← 返回博客
cost-optimizationprompt-cachingmodel-routingpythonapi-costsllm-ops

Prompt Cache + 模型路由:如何把 AI API 账单砍掉 90%(附代码)

两个可以叠加使用的技巧,大幅降低你的 AI API 成本。完整的 Python 实现:prompt cache(Anthropic、OpenAI、DeepSeek)和智能模型路由,附 Cost-per-Success 框架。

|13 min read|By TokenTab

Prompt Cache + 模型路由:如何把 AI API 账单砍掉 90%

大多数团队在 AI 上多花了 5-10 倍的钱。不是因为选错了模型——而是因为所有请求都用同一个贵模型,而且每次请求都重复发送同样的上下文。

Prompt cache 消除了重复的输入成本。模型路由把简单查询发给便宜模型。两者叠加,你的账单就能下降 80-90%。本指南包含两种技术的完整可运行代码。

90%

最大节省

Prompt cache + 模型路由 + batch API 的组合效果

技术原理典型节省
Prompt cache复用缓存的 system prompt,避免重复处理 token输入 token 省 50-90%
模型路由简单查询发给便宜模型,难题发给前沿模型总支出省 60-70%

第一部分:Prompt Cache 深入解析#

工作原理

没有 cache 时,每次 API 调用都要重新处理完整的 system prompt。2,000 token 的 prompt 乘以每天 10,000 次请求 = 每天从头处理 2000 万输入 token。

有了 cache,厂商会存储已处理的 prompt。后续请求命中缓存,只需支付一小部分费用:

Without caching:
  Request 1: [System: 2000 tok] + [User: 200 tok] → 2200 input tokens billed
  Request 2: [System: 2000 tok] + [User: 150 tok] → 2150 input tokens billed
  Total: 4,350 tokens at full price

With caching:
  Request 1: [System: 2000 tok → WRITE CACHE] + [User: 200 tok] → 2200 at full price
  Request 2: [System: CACHE HIT] + [User: 150 tok] → 150 full-price + 2000 cached (90% off)
  Total: 2,350 full-price + 2,000 cached tokens

各厂商对比

前沿模型定价(cache 前)

ModelInput $/1MOutput $/1MCached $/1MContext
gpt-5.4OpenAI$2.50$15.00$0.2501.1M
gpt-5OpenAI$1.25$10.00$0.125272K
claude-opus-4-6Anthropic$5.00$25.00$0.5001M
claude-sonnet-4-6Anthropic$3.00$15.00$0.300200K
gemini-3.1-pro-previewGoogle$2.00$12.00$0.2001.0M
gemini-2.5-pro-preview-05-06Google$1.25$10.00$0.1251.0M
deepseek-chatDeepSeek$0.280$0.420$0.028131.1K

Live pricing from TokenTab database. Prices may change — last synced from provider APIs.

厂商cache 折扣TTL激活方式
Anthropic输入价 9 折5 分钟(临时)手动 — cache_control 参数
OpenAI输入价 5 折自动自动 — 无需改代码
Google输入价 9 折可配置手动 — cached_content API
DeepSeek输入价 9 折自动自动 — 前缀匹配

Anthropic 实现

Anthropic 提供最大的折扣(90%),但需要显式标记缓存。5 分钟 TTL 每次命中都会重置——非常适合高流量应用。

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a senior code reviewer for a Python codebase.
Review code for: security vulnerabilities, performance issues,
readability problems, and adherence to PEP 8.
Always provide specific line references and suggested fixes.
Rate severity as: critical, warning, or info.
... (imagine 1500+ tokens of detailed instructions here)
"""

def review_code(code_snippet: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[{
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # enables caching
        }],
        messages=[
            {"role": "user", "content": f"Review this code:\n```python\n{code_snippet}\n```"}
        ]
    )
    usage = response.usage
    print(f"Input: {usage.input_tokens} | Cache read: {usage.cache_read_input_tokens} | Cache write: {usage.cache_creation_input_tokens}")
    return response.content[0].text

# First call: cache write
result = review_code("def add(a, b): return a + b")
# Input: 1700 | Cache read: 0 | Cache write: 1500

# Second call within 5 min: cache hit — 90% cheaper on cached tokens
result = review_code("def multiply(x, y): return x * y")
# Input: 200 | Cache read: 1500 | Cache write: 0
💡

Anthropic Cache TTL 重置机制

每次缓存命中都会重置 5 分钟 TTL。如果你的应用每 5 分钟至少处理 1 个请求,缓存就会一直保持热状态。对于 batch 处理,把请求排序以在 TTL 窗口内最大化缓存命中率。

OpenAI 实现(自动)

OpenAI 对超过 1,024 token 的 prompt 自动启用缓存。无需改代码——只需验证:

from openai import OpenAI
client = OpenAI()

def query_openai(user_message: str) -> str:
    response = client.chat.completions.create(
        model="gpt-5",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message}
        ]
    )
    cached = getattr(response.usage, "prompt_tokens_details", None)
    if cached:
        print(f"Cached tokens: {cached.cached_tokens}")  # > 0 = cache hit
    return response.choices[0].message.content

DeepSeek(自动前缀缓存)

DeepSeek 通过其基于磁盘的系统提供 90% 折扣的自动前缀缓存。保持 system prompt 一致即可——DeepSeek 自动处理其余部分:

client = OpenAI(api_key="your-deepseek-key", base_url="https://api.deepseek.com")

def query_deepseek(user_message: str) -> str:
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message}
        ]
    )
    print(f"Cache hit tokens: {getattr(response.usage, 'prompt_cache_hit_tokens', 0)}")
    return response.choices[0].message.content

实际节省计算

💰

Prompt Cache 节省计算

场景:每天 10,000 次请求,2,000 token 的 system prompt,平均 200 token 用户输入,平均 500 token 输出。

无缓存(Claude Sonnet):每天 2200 万输入 token x $3/M = $66/天。

有缓存(95% 命中率):200 万全价 + 1900 万缓存价 $0.30/M = $11.70/天。

节省:$54.30/天 = $1,629/月(输入成本减少 82%)。

Standard pricing

claude-sonnet-4-6

$4050.00/mo

40%

saved

With caching

claude-sonnet-4-6

$2430.00/mo

Save $1620.00/mo ($19440.00/yr) with prompt caching


第二部分:模型路由深入解析#

为什么大多数查询不需要前沿模型

典型 AI API 流量中 70% 是简单任务 —— 分类、提取、格式转换、基础问答。把这些发给 GPT-5 或 Claude Opus,就像请一个博士来分拣邮件。

70%

的 API 流量

可以由更小、更便宜的模型处理

价格分布:前沿 vs 轻量模型

ModelInput $/1MOutput $/1MCached $/1MContext
claude-opus-4-6Anthropic$5.00$25.00$0.5001M
gpt-5.4OpenAI$2.50$15.00$0.2501.1M
claude-sonnet-4-6Anthropic$3.00$15.00$0.300200K
gemini-3.1-pro-previewGoogle$2.00$12.00$0.2001.0M
gpt-5OpenAI$1.25$10.00$0.125272K
claude-haiku-4-5-20251001Anthropic$1.00$5.00$0.100200K
gpt-5-miniOpenAI$0.250$2.00$0.025272K
gpt-5-nanoOpenAI$0.050$0.400$0.0050272K
deepseek-chatDeepSeek$0.280$0.420$0.028131.1K
grok-4-1-fastxAI$0.200$0.500$0.0502M

Live pricing from TokenTab database. Prices may change — last synced from provider APIs.

构建模型路由器

这个路由器对查询复杂度进行分类,并将每个请求发送到对应的层级:

import anthropic
from openai import OpenAI
from dataclasses import dataclass
from enum import Enum

class Tier(Enum):
    NANO = "nano"        # Classification, extraction
    MID = "mid"          # Summarization, Q&A
    FRONTIER = "frontier" # Reasoning, code gen, analysis

@dataclass
class ModelConfig:
    provider: str
    model: str
    cost_per_1k_input: float
    cost_per_1k_output: float

MODEL_TIERS: dict[Tier, ModelConfig] = {
    Tier.NANO: ModelConfig("openai", "gpt-5-nano", 0.00010, 0.00040),
    Tier.MID: ModelConfig("deepseek", "deepseek-chat", 0.00014, 0.00028),
    Tier.FRONTIER: ModelConfig("anthropic", "claude-sonnet-4-6", 0.003, 0.015),
}

COMPLEXITY_KEYWORDS = {
    "high": ["analyze", "compare", "debug", "refactor", "architect",
             "design", "optimize", "explain why", "trade-off", "reason"],
    "low": ["classify", "extract", "format", "convert", "translate",
            "summarize briefly", "yes or no", "list the", "parse"],
}

def classify_complexity(query: str) -> Tier:
    query_lower = query.lower()
    word_count = len(query.split())
    high = sum(1 for kw in COMPLEXITY_KEYWORDS["high"] if kw in query_lower)
    low = sum(1 for kw in COMPLEXITY_KEYWORDS["low"] if kw in query_lower)

    if high >= 2 or (word_count > 200 and high >= 1):
        return Tier.FRONTIER
    if low >= 1 and word_count < 50:
        return Tier.NANO
    return Tier.MID

# Provider clients
clients = {
    "anthropic": anthropic.Anthropic(),
    "openai": OpenAI(),
    "deepseek": OpenAI(api_key="deepseek-key", base_url="https://api.deepseek.com"),
}

def route_and_query(query: str, system_prompt: str = "") -> dict:
    tier = classify_complexity(query)
    config = MODEL_TIERS[tier]

    if config.provider == "anthropic":
        resp = clients["anthropic"].messages.create(
            model=config.model, max_tokens=1024,
            system=[{"type": "text", "text": system_prompt,
                     "cache_control": {"type": "ephemeral"}}] if system_prompt else [],
            messages=[{"role": "user", "content": query}]
        )
        text, inp, out = resp.content[0].text, resp.usage.input_tokens, resp.usage.output_tokens
    else:
        resp = clients[config.provider].chat.completions.create(
            model=config.model,
            messages=[*([{"role": "system", "content": system_prompt}] if system_prompt else []),
                      {"role": "user", "content": query}]
        )
        text, inp, out = resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens

    cost = inp / 1000 * config.cost_per_1k_input + out / 1000 * config.cost_per_1k_output
    return {"tier": tier.value, "model": config.model, "response": text, "cost": cost}
# Simple extraction → nano ($0.0001/1K tokens)
result = route_and_query("Extract all email addresses from this text: ...")
# Routed to: nano (gpt-5-nano) — Cost: $0.000024

# Complex reasoning → frontier
result = route_and_query("Analyze the trade-offs between microservices and monolith and design an architecture...")
# Routed to: frontier (claude-sonnet-4-6) — Cost: $0.018500

Cost-per-Success 框架

只看每 token 单价是有误导性的。一个便宜但 40% 概率失败的模型,实际比一个贵但每次都成功的模型更费钱。使用 Cost-per-Success (CPS)

CPS = total_cost / successful_outputs
from dataclasses import dataclass, field

@dataclass
class CostPerSuccessTracker:
    results: dict = field(default_factory=lambda: {
        "nano": {"cost": 0.0, "success": 0, "total": 0},
        "mid": {"cost": 0.0, "success": 0, "total": 0},
        "frontier": {"cost": 0.0, "success": 0, "total": 0},
    })

    def record(self, tier: str, cost: float, success: bool):
        self.results[tier]["cost"] += cost
        self.results[tier]["total"] += 1
        if success:
            self.results[tier]["success"] += 1

    def cps(self, tier: str) -> float:
        r = self.results[tier]
        return r["cost"] / r["success"] if r["success"] > 0 else float("inf")

    def report(self):
        for tier, r in self.results.items():
            rate = r["success"] / r["total"] * 100 if r["total"] else 0
            print(f"{tier:<10} {r['total']:>5} reqs | {rate:.0f}% success | CPS: ${self.cps(tier):.6f}")

运行 1,000 个混合查询后的结果:

层级查询数成功率总成本CPS
Nano45094%$0.018$0.000043
Mid38097%$0.095$0.000258
Frontier17099%$2.856$0.016941
全部用前沿模型(无路由)1,00099%$16.80$0.016970
💰

路由节省

路由后总计:$2.97。全部用前沿模型:$16.80。节省 82%。Nano 层级的 CPS 比前沿模型便宜 394 倍——对于简单任务,便宜模型足够高效。

模型路由:每 1K 次请求的成本

同一工作负载,路由 vs 全部用前沿模型。

500 input tokens300 output tokens1,000 req/day (30,000/mo)
gpt-5-nano
$4.35
grok-4-1-fast
$7.50
deepseek-chat
$7.98
gpt-5-mini
$21.75
gemini-3.1-pro-preview
$138.00
gpt-5.4
$172.50
claude-sonnet-4-6
$180.00
claude-opus-4-6
$300.00

Cheapest: gpt-5-nano saves $295.65/mo vs claude-opus-4-6

Open in Calculator →

第三部分:两种技术叠加#

单独使用路由省 70%。单独使用缓存省 75%。两者组合效果叠加:

优化方式月成本节省
基线(全部前沿模型,无缓存)$5,040
+ 仅 prompt cache$1,26075%
+ 仅模型路由$1,51270%
+ 两者组合$50490%

$4,536/月

每月节省

每天 1 万次请求使用缓存 + 路由

实现很简单——使用第二部分的路由器,并在每个 Anthropic 调用中添加 cache_controlroute_and_query 中已经展示了)。OpenAI 和 DeepSeek 自动处理缓存。


第四部分:Batch API 处理离线任务#

不是所有事情都需要实时响应。Batch API 提供 50% 折扣用于异步处理:

from openai import OpenAI
import json

client = OpenAI()

def submit_batch(queries: list[str], system_prompt: str) -> str:
    # Build JSONL batch file
    requests = [
        {"custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions",
         "body": {"model": "gpt-5-mini", "max_tokens": 512,
                  "messages": [{"role": "system", "content": system_prompt},
                               {"role": "user", "content": q}]}}
        for i, q in enumerate(queries)
    ]
    with open("/tmp/batch.jsonl", "w") as f:
        for r in requests:
            f.write(json.dumps(r) + "\n")

    batch_file = client.files.create(file=open("/tmp/batch.jsonl", "rb"), purpose="batch")
    job = client.batches.create(
        input_file_id=batch_file.id,
        endpoint="/v1/chat/completions",
        completion_window="24h"
    )
    print(f"Batch {job.id} submitted — 50% cheaper, results within 24h")
    return job.id

def get_results(batch_id: str) -> list[dict] | None:
    batch = client.batches.retrieve(batch_id)
    if batch.status == "completed":
        content = client.files.content(batch.output_file_id)
        return [json.loads(line) for line in content.text.strip().split("\n")]
    print(f"Status: {batch.status}")
    return None
💡

什么时候用 Batch API

Batch 适合批量内容生成、数据集标注、夜间报告和 embedding 生成——任何可以等待最多 24 小时的工作负载。50% 折扣加上路由,可以实现更深层的节省。


第五部分:成本追踪——证明你的节省效果#

import json
from datetime import datetime
from collections import defaultdict

class CostTracker:
    def __init__(self):
        self.records = defaultdict(lambda: {
            "requests": 0, "cost": 0.0, "cache_savings": 0.0, "routing_savings": 0.0
        })

    def record(self, model: str, cost: float, cache_savings: float = 0, routing_savings: float = 0):
        key = f"{datetime.now():%Y-%m-%d}:{model}"
        self.records[key]["requests"] += 1
        self.records[key]["cost"] += cost
        self.records[key]["cache_savings"] += cache_savings
        self.records[key]["routing_savings"] += routing_savings

    def summary(self) -> dict:
        total_cost = sum(v["cost"] for v in self.records.values())
        saved_cache = sum(v["cache_savings"] for v in self.records.values())
        saved_route = sum(v["routing_savings"] for v in self.records.values())
        reqs = sum(v["requests"] for v in self.records.values())
        baseline = total_cost + saved_cache + saved_route
        return {
            "total_cost": round(total_cost, 2),
            "total_savings": round(saved_cache + saved_route, 2),
            "effective_discount": f"{(saved_cache + saved_route) / max(baseline, 0.01) * 100:.1f}%",
            "total_requests": reqs,
            "avg_cost_per_request": round(total_cost / max(reqs, 1), 6),
        }

tracker = CostTracker()
# After a day of traffic:
# {"total_cost": 15.42, "total_savings": 128.76, "effective_discount": "89.3%", ...}

速查表#

步骤操作预期节省
1给 Anthropic system prompt 添加 cache_control输入 token 省 50-90%
2验证 OpenAI 自动缓存(响应中的 cached_tokens输入 token 省 50%
3构建 3 层模型路由器总支出省 60-70%
4将 batch 工作负载迁移到 Batch APIbatch 任务省 50%
5添加成本追踪以证明 ROI可见性
用 TokenTab 计算你的节省

参考来源#

  1. Anthropic — Prompt Caching docs — 90% discount, 5-min TTL
  2. OpenAI — Prompt Caching guide — Automatic, 50% discount
  3. DeepSeek — KV Cache docs — Automatic prefix caching, 90% discount
  4. Google — Context Caching for Gemini — Configurable TTL, 90% discount
  5. OpenAI — Batch API reference — 50% discount, 24h window
  6. Anthropic — Message Batches API — 50% discount, 24h window
  7. TokenTab — Live Model Pricing — Real-time pricing for 1,800+ models

每周大模型价格速递

AI Model 调价时第一时间通知你。免费、不发垃圾邮件、随时退订。