Prompt Cache + 模型路由:如何把 AI API 账单砍掉 90%
大多数团队在 AI 上多花了 5-10 倍的钱。不是因为选错了模型——而是因为所有请求都用同一个贵模型,而且每次请求都重复发送同样的上下文。
Prompt cache 消除了重复的输入成本。模型路由把简单查询发给便宜模型。两者叠加,你的账单就能下降 80-90%。本指南包含两种技术的完整可运行代码。
90%
最大节省
Prompt cache + 模型路由 + batch API 的组合效果
| 技术 | 原理 | 典型节省 |
|---|---|---|
| Prompt cache | 复用缓存的 system prompt,避免重复处理 token | 输入 token 省 50-90% |
| 模型路由 | 简单查询发给便宜模型,难题发给前沿模型 | 总支出省 60-70% |
第一部分:Prompt Cache 深入解析#
工作原理
没有 cache 时,每次 API 调用都要重新处理完整的 system prompt。2,000 token 的 prompt 乘以每天 10,000 次请求 = 每天从头处理 2000 万输入 token。
有了 cache,厂商会存储已处理的 prompt。后续请求命中缓存,只需支付一小部分费用:
Without caching:
Request 1: [System: 2000 tok] + [User: 200 tok] → 2200 input tokens billed
Request 2: [System: 2000 tok] + [User: 150 tok] → 2150 input tokens billed
Total: 4,350 tokens at full price
With caching:
Request 1: [System: 2000 tok → WRITE CACHE] + [User: 200 tok] → 2200 at full price
Request 2: [System: CACHE HIT] + [User: 150 tok] → 150 full-price + 2000 cached (90% off)
Total: 2,350 full-price + 2,000 cached tokens
各厂商对比
前沿模型定价(cache 前)
| Model | Input $/1M | Output $/1M | Cached $/1M | Context |
|---|---|---|---|---|
| gpt-5.4OpenAI | $2.50 | $15.00 | $0.250 | 1.1M |
| gpt-5OpenAI | $1.25 | $10.00 | $0.125 | 272K |
| claude-opus-4-6Anthropic | $5.00 | $25.00 | $0.500 | 1M |
| claude-sonnet-4-6Anthropic | $3.00 | $15.00 | $0.300 | 200K |
| gemini-3.1-pro-previewGoogle | $2.00 | $12.00 | $0.200 | 1.0M |
| gemini-2.5-pro-preview-05-06Google | $1.25 | $10.00 | $0.125 | 1.0M |
| deepseek-chatDeepSeek | $0.280 | $0.420 | $0.028 | 131.1K |
Live pricing from TokenTab database. Prices may change — last synced from provider APIs.
| 厂商 | cache 折扣 | TTL | 激活方式 |
|---|---|---|---|
| Anthropic | 输入价 9 折 | 5 分钟(临时) | 手动 — cache_control 参数 |
| OpenAI | 输入价 5 折 | 自动 | 自动 — 无需改代码 |
| 输入价 9 折 | 可配置 | 手动 — cached_content API | |
| DeepSeek | 输入价 9 折 | 自动 | 自动 — 前缀匹配 |
Anthropic 实现
Anthropic 提供最大的折扣(90%),但需要显式标记缓存。5 分钟 TTL 每次命中都会重置——非常适合高流量应用。
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = """You are a senior code reviewer for a Python codebase.
Review code for: security vulnerabilities, performance issues,
readability problems, and adherence to PEP 8.
Always provide specific line references and suggested fixes.
Rate severity as: critical, warning, or info.
... (imagine 1500+ tokens of detailed instructions here)
"""
def review_code(code_snippet: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # enables caching
}],
messages=[
{"role": "user", "content": f"Review this code:\n```python\n{code_snippet}\n```"}
]
)
usage = response.usage
print(f"Input: {usage.input_tokens} | Cache read: {usage.cache_read_input_tokens} | Cache write: {usage.cache_creation_input_tokens}")
return response.content[0].text
# First call: cache write
result = review_code("def add(a, b): return a + b")
# Input: 1700 | Cache read: 0 | Cache write: 1500
# Second call within 5 min: cache hit — 90% cheaper on cached tokens
result = review_code("def multiply(x, y): return x * y")
# Input: 200 | Cache read: 1500 | Cache write: 0
Anthropic Cache TTL 重置机制
每次缓存命中都会重置 5 分钟 TTL。如果你的应用每 5 分钟至少处理 1 个请求,缓存就会一直保持热状态。对于 batch 处理,把请求排序以在 TTL 窗口内最大化缓存命中率。
OpenAI 实现(自动)
OpenAI 对超过 1,024 token 的 prompt 自动启用缓存。无需改代码——只需验证:
from openai import OpenAI
client = OpenAI()
def query_openai(user_message: str) -> str:
response = client.chat.completions.create(
model="gpt-5",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_message}
]
)
cached = getattr(response.usage, "prompt_tokens_details", None)
if cached:
print(f"Cached tokens: {cached.cached_tokens}") # > 0 = cache hit
return response.choices[0].message.content
DeepSeek(自动前缀缓存)
DeepSeek 通过其基于磁盘的系统提供 90% 折扣的自动前缀缓存。保持 system prompt 一致即可——DeepSeek 自动处理其余部分:
client = OpenAI(api_key="your-deepseek-key", base_url="https://api.deepseek.com")
def query_deepseek(user_message: str) -> str:
response = client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_message}
]
)
print(f"Cache hit tokens: {getattr(response.usage, 'prompt_cache_hit_tokens', 0)}")
return response.choices[0].message.content
实际节省计算
Prompt Cache 节省计算
场景:每天 10,000 次请求,2,000 token 的 system prompt,平均 200 token 用户输入,平均 500 token 输出。
无缓存(Claude Sonnet):每天 2200 万输入 token x $3/M = $66/天。
有缓存(95% 命中率):200 万全价 + 1900 万缓存价 $0.30/M = $11.70/天。
节省:$54.30/天 = $1,629/月(输入成本减少 82%)。
Standard pricing
claude-sonnet-4-6
$4050.00/mo
40%
saved
With caching
claude-sonnet-4-6
$2430.00/mo
Save $1620.00/mo ($19440.00/yr) with prompt caching
第二部分:模型路由深入解析#
为什么大多数查询不需要前沿模型
典型 AI API 流量中 70% 是简单任务 —— 分类、提取、格式转换、基础问答。把这些发给 GPT-5 或 Claude Opus,就像请一个博士来分拣邮件。
70%
的 API 流量
可以由更小、更便宜的模型处理
价格分布:前沿 vs 轻量模型
| Model | Input $/1M | Output $/1M | Cached $/1M | Context |
|---|---|---|---|---|
| claude-opus-4-6Anthropic | $5.00 | $25.00 | $0.500 | 1M |
| gpt-5.4OpenAI | $2.50 | $15.00 | $0.250 | 1.1M |
| claude-sonnet-4-6Anthropic | $3.00 | $15.00 | $0.300 | 200K |
| gemini-3.1-pro-previewGoogle | $2.00 | $12.00 | $0.200 | 1.0M |
| gpt-5OpenAI | $1.25 | $10.00 | $0.125 | 272K |
| claude-haiku-4-5-20251001Anthropic | $1.00 | $5.00 | $0.100 | 200K |
| gpt-5-miniOpenAI | $0.250 | $2.00 | $0.025 | 272K |
| gpt-5-nanoOpenAI | $0.050 | $0.400 | $0.0050 | 272K |
| deepseek-chatDeepSeek | $0.280 | $0.420 | $0.028 | 131.1K |
| grok-4-1-fastxAI | $0.200 | $0.500 | $0.050 | 2M |
Live pricing from TokenTab database. Prices may change — last synced from provider APIs.
构建模型路由器
这个路由器对查询复杂度进行分类,并将每个请求发送到对应的层级:
import anthropic
from openai import OpenAI
from dataclasses import dataclass
from enum import Enum
class Tier(Enum):
NANO = "nano" # Classification, extraction
MID = "mid" # Summarization, Q&A
FRONTIER = "frontier" # Reasoning, code gen, analysis
@dataclass
class ModelConfig:
provider: str
model: str
cost_per_1k_input: float
cost_per_1k_output: float
MODEL_TIERS: dict[Tier, ModelConfig] = {
Tier.NANO: ModelConfig("openai", "gpt-5-nano", 0.00010, 0.00040),
Tier.MID: ModelConfig("deepseek", "deepseek-chat", 0.00014, 0.00028),
Tier.FRONTIER: ModelConfig("anthropic", "claude-sonnet-4-6", 0.003, 0.015),
}
COMPLEXITY_KEYWORDS = {
"high": ["analyze", "compare", "debug", "refactor", "architect",
"design", "optimize", "explain why", "trade-off", "reason"],
"low": ["classify", "extract", "format", "convert", "translate",
"summarize briefly", "yes or no", "list the", "parse"],
}
def classify_complexity(query: str) -> Tier:
query_lower = query.lower()
word_count = len(query.split())
high = sum(1 for kw in COMPLEXITY_KEYWORDS["high"] if kw in query_lower)
low = sum(1 for kw in COMPLEXITY_KEYWORDS["low"] if kw in query_lower)
if high >= 2 or (word_count > 200 and high >= 1):
return Tier.FRONTIER
if low >= 1 and word_count < 50:
return Tier.NANO
return Tier.MID
# Provider clients
clients = {
"anthropic": anthropic.Anthropic(),
"openai": OpenAI(),
"deepseek": OpenAI(api_key="deepseek-key", base_url="https://api.deepseek.com"),
}
def route_and_query(query: str, system_prompt: str = "") -> dict:
tier = classify_complexity(query)
config = MODEL_TIERS[tier]
if config.provider == "anthropic":
resp = clients["anthropic"].messages.create(
model=config.model, max_tokens=1024,
system=[{"type": "text", "text": system_prompt,
"cache_control": {"type": "ephemeral"}}] if system_prompt else [],
messages=[{"role": "user", "content": query}]
)
text, inp, out = resp.content[0].text, resp.usage.input_tokens, resp.usage.output_tokens
else:
resp = clients[config.provider].chat.completions.create(
model=config.model,
messages=[*([{"role": "system", "content": system_prompt}] if system_prompt else []),
{"role": "user", "content": query}]
)
text, inp, out = resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens
cost = inp / 1000 * config.cost_per_1k_input + out / 1000 * config.cost_per_1k_output
return {"tier": tier.value, "model": config.model, "response": text, "cost": cost}
# Simple extraction → nano ($0.0001/1K tokens)
result = route_and_query("Extract all email addresses from this text: ...")
# Routed to: nano (gpt-5-nano) — Cost: $0.000024
# Complex reasoning → frontier
result = route_and_query("Analyze the trade-offs between microservices and monolith and design an architecture...")
# Routed to: frontier (claude-sonnet-4-6) — Cost: $0.018500
Cost-per-Success 框架
只看每 token 单价是有误导性的。一个便宜但 40% 概率失败的模型,实际比一个贵但每次都成功的模型更费钱。使用 Cost-per-Success (CPS):
CPS = total_cost / successful_outputs
from dataclasses import dataclass, field
@dataclass
class CostPerSuccessTracker:
results: dict = field(default_factory=lambda: {
"nano": {"cost": 0.0, "success": 0, "total": 0},
"mid": {"cost": 0.0, "success": 0, "total": 0},
"frontier": {"cost": 0.0, "success": 0, "total": 0},
})
def record(self, tier: str, cost: float, success: bool):
self.results[tier]["cost"] += cost
self.results[tier]["total"] += 1
if success:
self.results[tier]["success"] += 1
def cps(self, tier: str) -> float:
r = self.results[tier]
return r["cost"] / r["success"] if r["success"] > 0 else float("inf")
def report(self):
for tier, r in self.results.items():
rate = r["success"] / r["total"] * 100 if r["total"] else 0
print(f"{tier:<10} {r['total']:>5} reqs | {rate:.0f}% success | CPS: ${self.cps(tier):.6f}")
运行 1,000 个混合查询后的结果:
| 层级 | 查询数 | 成功率 | 总成本 | CPS |
|---|---|---|---|---|
| Nano | 450 | 94% | $0.018 | $0.000043 |
| Mid | 380 | 97% | $0.095 | $0.000258 |
| Frontier | 170 | 99% | $2.856 | $0.016941 |
| 全部用前沿模型(无路由) | 1,000 | 99% | $16.80 | $0.016970 |
路由节省
路由后总计:$2.97。全部用前沿模型:$16.80。节省 82%。Nano 层级的 CPS 比前沿模型便宜 394 倍——对于简单任务,便宜模型足够高效。
模型路由:每 1K 次请求的成本
同一工作负载,路由 vs 全部用前沿模型。
Cheapest: gpt-5-nano saves $295.65/mo vs claude-opus-4-6
Open in Calculator →第三部分:两种技术叠加#
单独使用路由省 70%。单独使用缓存省 75%。两者组合效果叠加:
| 优化方式 | 月成本 | 节省 |
|---|---|---|
| 基线(全部前沿模型,无缓存) | $5,040 | — |
| + 仅 prompt cache | $1,260 | 75% |
| + 仅模型路由 | $1,512 | 70% |
| + 两者组合 | $504 | 90% |
$4,536/月
每月节省
每天 1 万次请求使用缓存 + 路由
实现很简单——使用第二部分的路由器,并在每个 Anthropic 调用中添加 cache_control(route_and_query 中已经展示了)。OpenAI 和 DeepSeek 自动处理缓存。
第四部分:Batch API 处理离线任务#
不是所有事情都需要实时响应。Batch API 提供 50% 折扣用于异步处理:
from openai import OpenAI
import json
client = OpenAI()
def submit_batch(queries: list[str], system_prompt: str) -> str:
# Build JSONL batch file
requests = [
{"custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions",
"body": {"model": "gpt-5-mini", "max_tokens": 512,
"messages": [{"role": "system", "content": system_prompt},
{"role": "user", "content": q}]}}
for i, q in enumerate(queries)
]
with open("/tmp/batch.jsonl", "w") as f:
for r in requests:
f.write(json.dumps(r) + "\n")
batch_file = client.files.create(file=open("/tmp/batch.jsonl", "rb"), purpose="batch")
job = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch {job.id} submitted — 50% cheaper, results within 24h")
return job.id
def get_results(batch_id: str) -> list[dict] | None:
batch = client.batches.retrieve(batch_id)
if batch.status == "completed":
content = client.files.content(batch.output_file_id)
return [json.loads(line) for line in content.text.strip().split("\n")]
print(f"Status: {batch.status}")
return None
什么时候用 Batch API
Batch 适合批量内容生成、数据集标注、夜间报告和 embedding 生成——任何可以等待最多 24 小时的工作负载。50% 折扣加上路由,可以实现更深层的节省。
第五部分:成本追踪——证明你的节省效果#
import json
from datetime import datetime
from collections import defaultdict
class CostTracker:
def __init__(self):
self.records = defaultdict(lambda: {
"requests": 0, "cost": 0.0, "cache_savings": 0.0, "routing_savings": 0.0
})
def record(self, model: str, cost: float, cache_savings: float = 0, routing_savings: float = 0):
key = f"{datetime.now():%Y-%m-%d}:{model}"
self.records[key]["requests"] += 1
self.records[key]["cost"] += cost
self.records[key]["cache_savings"] += cache_savings
self.records[key]["routing_savings"] += routing_savings
def summary(self) -> dict:
total_cost = sum(v["cost"] for v in self.records.values())
saved_cache = sum(v["cache_savings"] for v in self.records.values())
saved_route = sum(v["routing_savings"] for v in self.records.values())
reqs = sum(v["requests"] for v in self.records.values())
baseline = total_cost + saved_cache + saved_route
return {
"total_cost": round(total_cost, 2),
"total_savings": round(saved_cache + saved_route, 2),
"effective_discount": f"{(saved_cache + saved_route) / max(baseline, 0.01) * 100:.1f}%",
"total_requests": reqs,
"avg_cost_per_request": round(total_cost / max(reqs, 1), 6),
}
tracker = CostTracker()
# After a day of traffic:
# {"total_cost": 15.42, "total_savings": 128.76, "effective_discount": "89.3%", ...}
速查表#
| 步骤 | 操作 | 预期节省 |
|---|---|---|
| 1 | 给 Anthropic system prompt 添加 cache_control | 输入 token 省 50-90% |
| 2 | 验证 OpenAI 自动缓存(响应中的 cached_tokens) | 输入 token 省 50% |
| 3 | 构建 3 层模型路由器 | 总支出省 60-70% |
| 4 | 将 batch 工作负载迁移到 Batch API | batch 任务省 50% |
| 5 | 添加成本追踪以证明 ROI | 可见性 |
参考来源#
- Anthropic — Prompt Caching docs — 90% discount, 5-min TTL
- OpenAI — Prompt Caching guide — Automatic, 50% discount
- DeepSeek — KV Cache docs — Automatic prefix caching, 90% discount
- Google — Context Caching for Gemini — Configurable TTL, 90% discount
- OpenAI — Batch API reference — 50% discount, 24h window
- Anthropic — Message Batches API — 50% discount, 24h window
- TokenTab — Live Model Pricing — Real-time pricing for 1,800+ models