Prompt Cache + 模型路由：如何把 AI API 账单砍掉 90%

大多数团队在 AI 上多花了 5-10 倍的钱。不是因为选错了模型——而是因为所有请求都用同一个贵模型，而且每次请求都重复发送同样的上下文。

Prompt cache 消除了重复的输入成本。模型路由把简单查询发给便宜模型。两者叠加，你的账单就能下降 80-90%。本指南包含两种技术的完整可运行代码。

90%

最大节省

Prompt cache + 模型路由 + batch API 的组合效果

技术	原理	典型节省
Prompt cache	复用缓存的 system prompt，避免重复处理 token	输入 token 省 50-90%
模型路由	简单查询发给便宜模型，难题发给前沿模型	总支出省 60-70%

第一部分：Prompt Cache 深入解析#

工作原理

没有 cache 时，每次 API 调用都要重新处理完整的 system prompt。2,000 token 的 prompt 乘以每天 10,000 次请求 = 每天从头处理 2000 万输入 token。

有了 cache，厂商会存储已处理的 prompt。后续请求命中缓存，只需支付一小部分费用：

Without caching:
  Request 1: [System: 2000 tok] + [User: 200 tok] → 2200 input tokens billed
  Request 2: [System: 2000 tok] + [User: 150 tok] → 2150 input tokens billed
  Total: 4,350 tokens at full price

With caching:
  Request 1: [System: 2000 tok → WRITE CACHE] + [User: 200 tok] → 2200 at full price
  Request 2: [System: CACHE HIT] + [User: 150 tok] → 150 full-price + 2000 cached (90% off)
  Total: 2,350 full-price + 2,000 cached tokens

各厂商对比

前沿模型定价（cache 前）

Model	Input $/1M	Output $/1M	Cached $/1M	Context
gpt-5.4OpenAI	$2.50	$15.00	$0.250	1.1M
gpt-5OpenAI	$1.25	$10.00	$0.125	272K
claude-opus-4-6Anthropic	$5.00	$25.00	$0.500	1M
claude-sonnet-4-6Anthropic	$3.00	$15.00	$0.300	200K
gemini-3.1-pro-previewGoogle	$2.00	$12.00	$0.200	1.0M
gemini-2.5-pro-preview-05-06Google	$1.25	$10.00	$0.125	1.0M
deepseek-chatDeepSeek	$0.280	$0.420	$0.028	131.1K

Live pricing from TokenTab database. Prices may change — last synced from provider APIs.

厂商	cache 折扣	TTL	激活方式
Anthropic	输入价 9 折	5 分钟（临时）	手动 — `cache_control` 参数
OpenAI	输入价 5 折	自动	自动 — 无需改代码
Google	输入价 9 折	可配置	手动 — `cached_content` API
DeepSeek	输入价 9 折	自动	自动 — 前缀匹配

Anthropic 实现

Anthropic 提供最大的折扣（90%），但需要显式标记缓存。5 分钟 TTL 每次命中都会重置——非常适合高流量应用。

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a senior code reviewer for a Python codebase.
Review code for: security vulnerabilities, performance issues,
readability problems, and adherence to PEP 8.
Always provide specific line references and suggested fixes.
Rate severity as: critical, warning, or info.
... (imagine 1500+ tokens of detailed instructions here)
"""

def review_code(code_snippet: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[{
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # enables caching
        }],
        messages=[
            {"role": "user", "content": f"Review this code:\n```python\n{code_snippet}\n```"}
        ]
    )
    usage = response.usage
    print(f"Input: {usage.input_tokens} | Cache read: {usage.cache_read_input_tokens} | Cache write: {usage.cache_creation_input_tokens}")
    return response.content[0].text

# First call: cache write
result = review_code("def add(a, b): return a + b")
# Input: 1700 | Cache read: 0 | Cache write: 1500

# Second call within 5 min: cache hit — 90% cheaper on cached tokens
result = review_code("def multiply(x, y): return x * y")
# Input: 200 | Cache read: 1500 | Cache write: 0

💡

Anthropic Cache TTL 重置机制

每次缓存命中都会重置 5 分钟 TTL。如果你的应用每 5 分钟至少处理 1 个请求，缓存就会一直保持热状态。对于 batch 处理，把请求排序以在 TTL 窗口内最大化缓存命中率。

OpenAI 实现（自动）

OpenAI 对超过 1,024 token 的 prompt 自动启用缓存。无需改代码——只需验证：

from openai import OpenAI
client = OpenAI()

def query_openai(user_message: str) -> str:
    response = client.chat.completions.create(
        model="gpt-5",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message}
        ]
    )
    cached = getattr(response.usage, "prompt_tokens_details", None)
    if cached:
        print(f"Cached tokens: {cached.cached_tokens}")  # > 0 = cache hit
    return response.choices[0].message.content

DeepSeek（自动前缀缓存）

DeepSeek 通过其基于磁盘的系统提供 90% 折扣的自动前缀缓存。保持 system prompt 一致即可——DeepSeek 自动处理其余部分：

client = OpenAI(api_key="your-deepseek-key", base_url="https://api.deepseek.com")

def query_deepseek(user_message: str) -> str:
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message}
        ]
    )
    print(f"Cache hit tokens: {getattr(response.usage, 'prompt_cache_hit_tokens', 0)}")
    return response.choices[0].message.content

实际节省计算

💰

Prompt Cache 节省计算

场景：每天 10,000 次请求，2,000 token 的 system prompt，平均 200 token 用户输入，平均 500 token 输出。

无缓存（Claude Sonnet）：每天 2200 万输入 token x $3/M = $66/天。

有缓存（95% 命中率）：200 万全价 + 1900 万缓存价 $0.30/M = $11.70/天。

节省：$54.30/天 = $1,629/月（输入成本减少 82%）。

Standard pricing

claude-sonnet-4-6

$4050.00/mo

40%

saved

With caching

claude-sonnet-4-6

$2430.00/mo

Save $1620.00/mo ($19440.00/yr) with prompt caching

第二部分：模型路由深入解析#

为什么大多数查询不需要前沿模型

典型 AI API 流量中 70% 是简单任务 —— 分类、提取、格式转换、基础问答。把这些发给 GPT-5 或 Claude Opus，就像请一个博士来分拣邮件。

70%

的 API 流量

可以由更小、更便宜的模型处理

价格分布：前沿 vs 轻量模型

Model	Input $/1M	Output $/1M	Cached $/1M	Context
claude-opus-4-6Anthropic	$5.00	$25.00	$0.500	1M
gpt-5.4OpenAI	$2.50	$15.00	$0.250	1.1M
claude-sonnet-4-6Anthropic	$3.00	$15.00	$0.300	200K
gemini-3.1-pro-previewGoogle	$2.00	$12.00	$0.200	1.0M
gpt-5OpenAI	$1.25	$10.00	$0.125	272K
claude-haiku-4-5-20251001Anthropic	$1.00	$5.00	$0.100	200K
gpt-5-miniOpenAI	$0.250	$2.00	$0.025	272K
gpt-5-nanoOpenAI	$0.050	$0.400	$0.0050	272K
deepseek-chatDeepSeek	$0.280	$0.420	$0.028	131.1K
grok-4-1-fastxAI	$0.200	$0.500	$0.050	2M

Live pricing from TokenTab database. Prices may change — last synced from provider APIs.

构建模型路由器

这个路由器对查询复杂度进行分类，并将每个请求发送到对应的层级：

import anthropic
from openai import OpenAI
from dataclasses import dataclass
from enum import Enum

class Tier(Enum):
    NANO = "nano"        # Classification, extraction
    MID = "mid"          # Summarization, Q&A
    FRONTIER = "frontier" # Reasoning, code gen, analysis

@dataclass
class ModelConfig:
    provider: str
    model: str
    cost_per_1k_input: float
    cost_per_1k_output: float

MODEL_TIERS: dict[Tier, ModelConfig] = {
    Tier.NANO: ModelConfig("openai", "gpt-5-nano", 0.00010, 0.00040),
    Tier.MID: ModelConfig("deepseek", "deepseek-chat", 0.00014, 0.00028),
    Tier.FRONTIER: ModelConfig("anthropic", "claude-sonnet-4-6", 0.003, 0.015),
}

COMPLEXITY_KEYWORDS = {
    "high": ["analyze", "compare", "debug", "refactor", "architect",
             "design", "optimize", "explain why", "trade-off", "reason"],
    "low": ["classify", "extract", "format", "convert", "translate",
            "summarize briefly", "yes or no", "list the", "parse"],
}

def classify_complexity(query: str) -> Tier:
    query_lower = query.lower()
    word_count = len(query.split())
    high = sum(1 for kw in COMPLEXITY_KEYWORDS["high"] if kw in query_lower)
    low = sum(1 for kw in COMPLEXITY_KEYWORDS["low"] if kw in query_lower)

    if high >= 2 or (word_count > 200 and high >= 1):
        return Tier.FRONTIER
    if low >= 1 and word_count < 50:
        return Tier.NANO
    return Tier.MID

# Provider clients
clients = {
    "anthropic": anthropic.Anthropic(),
    "openai": OpenAI(),
    "deepseek": OpenAI(api_key="deepseek-key", base_url="https://api.deepseek.com"),
}

def route_and_query(query: str, system_prompt: str = "") -> dict:
    tier = classify_complexity(query)
    config = MODEL_TIERS[tier]

    if config.provider == "anthropic":
        resp = clients["anthropic"].messages.create(
            model=config.model, max_tokens=1024,
            system=[{"type": "text", "text": system_prompt,
                     "cache_control": {"type": "ephemeral"}}] if system_prompt else [],
            messages=[{"role": "user", "content": query}]
        )
        text, inp, out = resp.content[0].text, resp.usage.input_tokens, resp.usage.output_tokens
    else:
        resp = clients[config.provider].chat.completions.create(
            model=config.model,
            messages=[*([{"role": "system", "content": system_prompt}] if system_prompt else []),
                      {"role": "user", "content": query}]
        )
        text, inp, out = resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens

    cost = inp / 1000 * config.cost_per_1k_input + out / 1000 * config.cost_per_1k_output
    return {"tier": tier.value, "model": config.model, "response": text, "cost": cost}

# Simple extraction → nano ($0.0001/1K tokens)
result = route_and_query("Extract all email addresses from this text: ...")
# Routed to: nano (gpt-5-nano) — Cost: $0.000024

# Complex reasoning → frontier
result = route_and_query("Analyze the trade-offs between microservices and monolith and design an architecture...")
# Routed to: frontier (claude-sonnet-4-6) — Cost: $0.018500

Cost-per-Success 框架

只看每 token 单价是有误导性的。一个便宜但 40% 概率失败的模型，实际比一个贵但每次都成功的模型更费钱。使用 Cost-per-Success (CPS)：

CPS = total_cost / successful_outputs

from dataclasses import dataclass, field

@dataclass
class CostPerSuccessTracker:
    results: dict = field(default_factory=lambda: {
        "nano": {"cost": 0.0, "success": 0, "total": 0},
        "mid": {"cost": 0.0, "success": 0, "total": 0},
        "frontier": {"cost": 0.0, "success": 0, "total": 0},
    })

    def record(self, tier: str, cost: float, success: bool):
        self.results[tier]["cost"] += cost
        self.results[tier]["total"] += 1
        if success:
            self.results[tier]["success"] += 1

    def cps(self, tier: str) -> float:
        r = self.results[tier]
        return r["cost"] / r["success"] if r["success"] > 0 else float("inf")

    def report(self):
        for tier, r in self.results.items():
            rate = r["success"] / r["total"] * 100 if r["total"] else 0
            print(f"{tier:<10} {r['total']:>5} reqs | {rate:.0f}% success | CPS: ${self.cps(tier):.6f}")

运行 1,000 个混合查询后的结果：

层级	查询数	成功率	总成本	CPS
Nano	450	94%	$0.018	$0.000043
Mid	380	97%	$0.095	$0.000258
Frontier	170	99%	$2.856	$0.016941
全部用前沿模型（无路由）	1,000	99%	$16.80	$0.016970

💰

路由节省

路由后总计：$2.97。全部用前沿模型：$16.80。节省 82%。Nano 层级的 CPS 比前沿模型便宜 394 倍——对于简单任务，便宜模型足够高效。

模型路由：每 1K 次请求的成本

同一工作负载，路由 vs 全部用前沿模型。

500 input tokens300 output tokens1,000 req/day (30,000/mo)

gpt-5-nano

$4.35

grok-4-1-fast

$7.50

deepseek-chat

$7.98

gpt-5-mini

$21.75

gemini-3.1-pro-preview

$138.00

gpt-5.4

$172.50

claude-sonnet-4-6

$180.00

claude-opus-4-6

$300.00

Cheapest: gpt-5-nano saves $295.65/mo vs claude-opus-4-6

Open in Calculator →

第三部分：两种技术叠加#

单独使用路由省 70%。单独使用缓存省 75%。两者组合效果叠加：

优化方式	月成本	节省
基线（全部前沿模型，无缓存）	$5,040	—
+ 仅 prompt cache	$1,260	75%
+ 仅模型路由	$1,512	70%
+ 两者组合	$504	90%

$4,536/月

每月节省

每天 1 万次请求使用缓存 + 路由

实现很简单——使用第二部分的路由器，并在每个 Anthropic 调用中添加 cache_control（route_and_query 中已经展示了）。OpenAI 和 DeepSeek 自动处理缓存。

第四部分：Batch API 处理离线任务#

不是所有事情都需要实时响应。Batch API 提供 50% 折扣用于异步处理：

from openai import OpenAI
import json

client = OpenAI()

def submit_batch(queries: list[str], system_prompt: str) -> str:
    # Build JSONL batch file
    requests = [
        {"custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions",
         "body": {"model": "gpt-5-mini", "max_tokens": 512,
                  "messages": [{"role": "system", "content": system_prompt},
                               {"role": "user", "content": q}]}}
        for i, q in enumerate(queries)
    ]
    with open("/tmp/batch.jsonl", "w") as f:
        for r in requests:
            f.write(json.dumps(r) + "\n")

    batch_file = client.files.create(file=open("/tmp/batch.jsonl", "rb"), purpose="batch")
    job = client.batches.create(
        input_file_id=batch_file.id,
        endpoint="/v1/chat/completions",
        completion_window="24h"
    )
    print(f"Batch {job.id} submitted — 50% cheaper, results within 24h")
    return job.id

def get_results(batch_id: str) -> list[dict] | None:
    batch = client.batches.retrieve(batch_id)
    if batch.status == "completed":
        content = client.files.content(batch.output_file_id)
        return [json.loads(line) for line in content.text.strip().split("\n")]
    print(f"Status: {batch.status}")
    return None

💡

什么时候用 Batch API

Batch 适合批量内容生成、数据集标注、夜间报告和 embedding 生成——任何可以等待最多 24 小时的工作负载。50% 折扣加上路由，可以实现更深层的节省。

第五部分：成本追踪——证明你的节省效果#

import json
from datetime import datetime
from collections import defaultdict

class CostTracker:
    def __init__(self):
        self.records = defaultdict(lambda: {
            "requests": 0, "cost": 0.0, "cache_savings": 0.0, "routing_savings": 0.0
        })

    def record(self, model: str, cost: float, cache_savings: float = 0, routing_savings: float = 0):
        key = f"{datetime.now():%Y-%m-%d}:{model}"
        self.records[key]["requests"] += 1
        self.records[key]["cost"] += cost
        self.records[key]["cache_savings"] += cache_savings
        self.records[key]["routing_savings"] += routing_savings

    def summary(self) -> dict:
        total_cost = sum(v["cost"] for v in self.records.values())
        saved_cache = sum(v["cache_savings"] for v in self.records.values())
        saved_route = sum(v["routing_savings"] for v in self.records.values())
        reqs = sum(v["requests"] for v in self.records.values())
        baseline = total_cost + saved_cache + saved_route
        return {
            "total_cost": round(total_cost, 2),
            "total_savings": round(saved_cache + saved_route, 2),
            "effective_discount": f"{(saved_cache + saved_route) / max(baseline, 0.01) * 100:.1f}%",
            "total_requests": reqs,
            "avg_cost_per_request": round(total_cost / max(reqs, 1), 6),
        }

tracker = CostTracker()
# After a day of traffic:
# {"total_cost": 15.42, "total_savings": 128.76, "effective_discount": "89.3%", ...}

速查表#

步骤	操作	预期节省
1	给 Anthropic system prompt 添加 `cache_control`	输入 token 省 50-90%
2	验证 OpenAI 自动缓存（响应中的 `cached_tokens`）	输入 token 省 50%
3	构建 3 层模型路由器	总支出省 60-70%
4	将 batch 工作负载迁移到 Batch API	batch 任务省 50%
5	添加成本追踪以证明 ROI	可见性

用 TokenTab 计算你的节省 →

参考来源#

Anthropic — Prompt Caching docs — 90% discount, 5-min TTL
OpenAI — Prompt Caching guide — Automatic, 50% discount
DeepSeek — KV Cache docs — Automatic prefix caching, 90% discount
Google — Context Caching for Gemini — Configurable TTL, 90% discount
OpenAI — Batch API reference — 50% discount, 24h window
Anthropic — Message Batches API — 50% discount, 24h window
TokenTab — Live Model Pricing — Real-time pricing for 1,800+ models

Prompt Cache + 模型路由：如何把 AI API 账单砍掉 90%（附代码）

Prompt Cache + 模型路由：如何把 AI API 账单砍掉 90%

第一部分：Prompt Cache 深入解析#

工作原理

各厂商对比

Anthropic 实现

OpenAI 实现（自动）

DeepSeek（自动前缀缓存）

实际节省计算

第二部分：模型路由深入解析#

为什么大多数查询不需要前沿模型

构建模型路由器

Cost-per-Success 框架

第三部分：两种技术叠加#

第四部分：Batch API 处理离线任务#

第五部分：成本追踪——证明你的节省效果#

速查表#

参考来源#

每周大模型价格速递