Claude Code 上下文管理

·(已編輯)··

閱讀AI 輔助 · 口述整理 · AI 輔助

關鍵洞察

AI · GEN

开场：我们到底在聊什么

如果你写过任何基于 Claude API / OpenAI API 的应用，大概都踩过同一个坑：上下文长到一定程度，要么直接 prompt-too-long 被拒，要么钱包被缓存缺失打穿，要么模型开始"记性变差"。

大家通常会自己写一层简单的 compact：到阈值就丢一段历史，或者塞一条总结。Claude Code 这边不一样——它不是一个机制，而是一条五级流水线。每一级负责不同的场景、不同的代价、不同的兜底时机，每一级最终都落在发给 API 的那个 messages 数组上。

这篇就是把这条流水线彻底拆开，告诉你每一级：

是什么
什么时候触发
触发后发生什么
messages 数组前后怎么变
一个实际场景

先对齐一个共识：messages 数组才是唯一真相

对 Anthropic API 来说，所有"上下文管理"归根到底就是客户端决定一件事——

这一轮我要把哪一组 messages 发过去。

API 不认识"你的会话历史"、也不认识"compact 过几次"，它只认识你这次 request payload 里的那个 messages 数组。每条 message 是 user 或 assistant，content 可能是字符串、也可能是 text / tool_use / tool_result / thinking 这些 block 的混合。

Claude Code 的上下文管理，说白了就是在 client side 一层一层修改这个数组，然后才把修改后的结果发给 API。极少数情况下还会附带一些服务端 Context Management 指令（比如 cache_edits / clear_tool_uses_20250919），但那是锦上添花，不是主干。

记住这一点：下面讲的每一级，都是在重构 messages 数组。

总览：五级流水线

每次要向 API 发请求之前，Claude Code 按这个顺序跑：

CodeBlock Loading...

设计直觉是：

最前面的机制最便宜、最精细，只在入口处拦截或替换小块内容
越往后代价越高、作用越粗，最后一步才会真的去调一次模型来做摘要
每一级都在前一级没解决问题时才接管，避免"上来就做重活"

下面挨个拆开。

第一级：Tool Result Budget（工具结果预算）

是什么

这是入口限流，不是对历史的压缩。

当一个工具（比如 Bash、Read、Grep）执行完，返回的 tool_result 要塞进 messages 数组之前，先过一道预算检查：太大就不让原文进去。

什么时候触发

两个时间点：

单结果时刻：每个工具刚执行完，结果准备包装成 tool_result block 时
聚合时刻：每次向 API 发请求之前，对整条 user message 里的所有 tool_result 做一次总预算检查（主要防并行工具同轮返回的叠加）

触发后发生什么

两级阈值（默认值）：

单结果 50K 字符：超了就持久化到磁盘的 tool-results/<tool_use_id>.txt，messages 里只留一个"引用消息"
单 message 里所有 tool_result 合计 200K 字符：超了就挑最大的那个做同样替换，直到回到预算内

替换后的内容是固定模板：

CodeBlock Loading...

messages 数组前后对比

CodeBlock Loading...

场景

你让 Claude 跑 cat huge.log，stdout 500KB。如果这 500KB 原文直接进上下文：

本轮 API 直接吃掉大半个窗口
后续每一轮还会重复发一次（没有任何 client cache 能解决）
模型其实也读不动这么长的 log

Tool Result Budget 做的事是：只让 2KB 预览进入上下文，模型看到"文件在那里"，真需要细节时再通过 Read(offset, limit) 精确回读对应片段。一个轻量的磁盘层替代了一个沉重的 token 层。

第二级：Snip（精准裁剪）

是什么

一种给模型赋能的主动删除机制：给每一条 user input 挂一个短 ID，让模型可以引用这个 ID 说"这一整轮我不要了"，然后把这条 user input 到下一条 user input 之间所有内容（包括该 user message 本身、随后的 assistant 思考、所有 tooluse / toolresult）整段从 messages 数组里物理移除。

这是整个流水线里唯一由模型主导的一级——其他几级都是 client 自动做决定。

删除单位：一整个 user turn

理解 Snip 最重要的一点是它按 user turn 为单位工作，不是按单条 message。

一个 user turn 长这样：

CodeBlock Loading...

当模型对某个 ID 调用 SnipTool，整个从该 user input 开始、到下一条 user input 之前的所有消息都从后续 API 请求里消失。

什么时候触发

三个时间点：

每次向 API 发请求前：在发送的那一份 messages 副本上，给所有非 meta 的 user message 追加 [id:<短ID>] 尾标（tool_result 类的 user message 不算"真正的 user input"，不会挂 ID）
上下文每增长约 10K tokens 却没 snip 过：注入一条 nudge attachment，提醒模型"可以 snip 一下"
模型主动调用 SnipTool：真正执行删除

触发后发生什么

短 ID 由 message UUID 派生（hex 前 10 位转 base36 前 6 位），每条 user input 都有稳定、短小、模型容易复述的 ID
模型调用 SnipTool 传入一个或多个 ID，client 把对应 user turn 的所有 UUID 收集起来，从 in-memory 数组里删掉
removedUuids 写到 transcript 边界；resume 时重放保证持久化
删除后对受影响的 parentUuid 做回溯修复，避免 dangling 链

关键细节：[id:xxxxxx] 这个 tag 只加在"发给 API 的 copy"上，不会写回原始存储。transcript 里用户的原话永远干净，只有 model-visible 的那份带 tag。

完整示例：从多轮对话到一次 Snip

假设会话里已经积累了下面这段历史（为方便阅读省略了部分字段）。注意 Claude Code 发给 API 之前会在每条真正的 user input末尾追加 [id:...] 标签——tool_result 类型的 user message 不挂标签。

第 0 步：原始 messages 数组（发给 API 的版本）

CodeBlock Loading...

此时模型在这一轮里能看到两个真正的 user input ID：abc123（调研 TODO）和 def456（修登录 bug）。由于用户已经明确说"先别搞 TODO 了"，关于 Turn 1 的所有内容（23 处 TODO 列表、login.ts 全文、以及对应的思考）对后续修 bug 的工作已经是纯粹的 token 负担。

第 1 步：模型主动调用 SnipTool

CodeBlock Loading...

第 2 步：Snip 执行后，下一轮发给 API 的 messages 数组

Turn 1 整段（含 user input + 两次 assistant tooluse + 两次 toolresult，共 5 条 message）被物理删除：

CodeBlock Loading...

值得注意的几件事

删除是成对的。toolu_01 的 tool_use 和它对应的 tool_result 一起消失，toolu_02 同理。这样 API 侧不会出现"有 toolresult 但找不到 tooluse"或反过来的错误。
[id:def456] 没被动过。Snip 精准作用在 abc123 所在的那个 user turn，不会误伤后续 turn。
transcript 落盘版不变。如果你 resume 这个会话，Claude Code 会根据 removedUuids 重放同样的删除结果，让 model-visible 视图保持一致——但用户"原话"本身始终保留在磁盘上，随时可审计。
上下文节省明显。Turn 1 的 Grep 结果 + login.ts 全文大概 6~8K tokens，一次 Snip 直接省掉，而且这是精准保留原文的做法，不是总结。

为什么这么设计

传统 compact 的弱点是"一刀切总结"——粒度粗，容易把有用的原文一并丢掉。Snip 恰好相反：由模型自己判断哪一轮已经作废，整轮精准切除，剩下的 recent messages 还是原汁原味。两者互补：

当你只是 pivot 一次方向、想丢掉某段岔路，用 Snip
当整个上下文已经超载、没有明显的"某轮已作废"边界时，用后面的 Microcompact / Autocompact

第三级：Microcompact（轻量重写）

是什么

只针对旧工具结果的轻量压缩。它不总结对话、不调模型、不改用户消息，只干一件事：把旧的大块 tool_result.content 换成占位符或 cache 编辑指令。

只处理这几个工具的结果：Read、Bash、Grep、Glob、WebSearch、WebFetch、Edit、Write。用户的文字、模型的思考、plan、attachment 它一概不动。

什么时候触发

两条独立路径：

路径 A：Time-based Microcompact

默认关闭，开启后：
距离上一条 assistant message 超过 60 分钟 +
在主线程 +
每次发请求前检查一次

路径 B：Cached Microcompact

Feature flag 开启 +
模型支持 cache editing +
在主线程 +
每次发请求前检查一次

触发后发生什么

Time-based 路径——直接改本地 messages：

按 tool id 找出所有"可压缩工具"的 tool_result
保留最近 5 个，其余的 content 原文替换成字面字符串 [Old tool result content cleared]
顺带 reset cached microcompact 的模块状态（避免 cache 引用失效的 tool id）

Cached 路径——本地 messages 不变，而是在 API 层带 cache_edits：

本地数组里那些旧 tool_result 看上去毫发无损
但向 Anthropic 发请求时，payload 多了一段 cache_edits 指令，告诉服务端"你缓存里编号 xxx 的那几段我不要了"
好处是 prompt cache prefix 尽量保住，避免 time-based 那种"一动就全 miss"

另外还有一层API-native Context Management，不是客户端做的，而是 Anthropic API 原生支持的策略：

CodeBlock Loading...

这两个块加在 API 参数里，由服务端在超过 180K input tokens 时自动清理 tool_use 类内容。

完整示例：Time-based 路径

为了把示例控制在可读长度内，下面演示一个缩小版场景——假设 keepRecent = 2（默认是 5）。场景是：你让 Claude 帮你调研一个项目，连续跑了 3 个工具，然后去吃午饭，70 分钟后回来继续问问题。

第 0 步：离开前的 messages 数组

CodeBlock Loading...

第 1 步：用户 70 分钟后回来

CodeBlock Loading...

这一刻，Time-based Microcompact 的触发条件成立：主线程 + 有上一条 assistant + gap > 60 分钟。

第 2 步：Microcompact 扫一遍 messages 数组

从旧到新找出所有"可压缩工具"的 tool_result：

CodeBlock Loading...

保留最近 keepRecent = 2 个（toolu_02 / toolu_03），其余的 content 替换为占位符。

第 3 步：变换后的 messages 数组

CodeBlock Loading...

值得注意的几件事

tool_use_id 不删——toolu_01 的 tooluse（含参数 pattern: "**/*.ts"）完整保留，只是对应的 toolresult content 被替换成字面字符串。API 侧的 tool_use ↔ tool_result 配对关系依然成立。
模型仍能判断"跑过哪些工具"。它能看到 toolu_01 是一次 Glob("**/*.ts") 调用，只是具体返回已作废。如果后续真的需要，它可以再调一次 Glob 重新取。
用户文本、assistant 的思考一概不动——Microcompact 只针对 tool_result 的 content。
体积收益有多大。这个例子里 toolu01 的 Glob 输出大概 3KB，一个占位符 31 字节。真实会话里你很可能有 10+ 个旧 toolresult，每个几 KB 到几十 KB，省下来是实打实的 5K~50K tokens。

Cached 路径的差异

Cached 路径的关键是本地 messages 完全不变，替换发生在服务端缓存那边。对比如下（沿用上面的场景但走 Cached 路径）：

CodeBlock Loading...

这样的好处是：prompt cache prefix 不会被打断。Time-based 那种直接改本地 messages 的做法，会导致缓存 key 改变，下一次请求所有 cache hit 归零。Cached 路径通过让服务端自己"内部删"，既释放了 token 成本，又保住了 cache 命中率。

本质上就是缓存的冷热区分

退一步看，两条路径的分工其实就是一个 cache state machine：

热缓存（cache 仍有价值）→ 走 Cached 路径，API 层 cache_edits 精细化编辑
冷缓存（60 分钟无活动，cache 多半已 expire）→ 走 Time-based 路径，放弃 cache 直接裁切本地 messages

并且源码里这两条路径是短路关系——Time-based 先检查，一旦命中就直接 return，不再走 Cached。这个顺序也是冷/热的自然推论：既然都判断"缓存已冷"了，再去做 cache editing 也没意义。

场景小结

活跃使用中：Cached 路径在后台静默裁剪服务端 cache，本地体感 0 变化，但请求更便宜
长时间暂停后回来：Time-based 路径直接本地清老 tool_result，舍弃 cache 命中换上下文空间
接近 token 上限时：API-native 的 clear_tool_uses_20250919 兜底，由服务端在 180K 阈值上自动清理

第四级：Context Collapse

这一级在 query pipeline 里存在，位置在 Microcompact 之后、Autocompact 之前。启用它会抑制主动 Autocompact——在 Claude Code 的设计里，Collapse 和 Autocompact 竞争同一段 headroom，所以开 Collapse 时 shouldAutoCompact() 直接返回 false，让 Collapse 接管。

从 transcript 里能看到 Collapse 会落两种记录：

marble-origami-commit：append-only 的 splice 指令，记录"如何把某段历史折叠成一条 summary placeholder"，包含 collapseId / summaryUuid / summaryContent / firstArchivedUuid / lastArchivedUuid
marble-origami-snapshot：last-wins 的 staged 状态快照，包含 staged spans / armed 标志 / lastSpawnTokens

这两种记录结构暗示 Collapse 在做"分段归档 + 摘要 placeholder"——大致工作方式是把一段早期历史评分、挑选、打包成一个带摘要的归档单元，让后续请求里那段历史被一条 placeholder 替代。更细的 staged spans 挑选算法、summary placeholder 的具体格式、触发阈值链条，本篇不展开。

第五级：Autocompact（重度兜底）

是什么

当前面四级都没能把上下文压下来时的最后一道防线。它本身不做压缩，而是两条子路径二选一：

首选：Session Memory Compact（读一份持续维护的 summary.md）
兜底：传统 LLM Compact（临时调一次模型做 9 段式摘要）

什么时候触发

shouldAutoCompact() 判断"上下文接近 token 上限"时。注意如果 Context Collapse 启用了，这一步会被直接跳过（让 Collapse 处理）。

触发后的流程是：

CodeBlock Loading...

子路径 A：Session Memory Compact（结构化摘要）

是什么

核心思路是：不要等到上下文爆了才开始总结，平时就在后台持续维护一份结构化摘要文件，爆的时候直接读这份文件当摘要用。

这样的好处非常明确：

触发压缩时 zero API cost——不需要临时调模型，直接读磁盘
摘要质量更稳定——后台每隔一段时间更新一次，覆盖面比"临时生成一份"要系统
可以持续沉淀跨轮信息——做过的错误修正、对项目的认知，都能累积在一份文件里

Session Memory 文件的存储位置

文件名 summary.md，完整路径：

CodeBlock Loading...

注意是每个 session 一份，不是项目级共享。理由很直接——不同 session 做不同的事，公用一份会互相串扰。

10 段式模板

summary.md 不是自由格式日记，而是后台 agent 按固定模板填空。初始化时就会先写入一份空模板，后续每次 extraction 只更新正文。完整模板有 10 个 section：

#	Section	这个 section 装什么（guidance）
1	Session Title	一个信息密度高的 5-10 词 session 标题，无填充词
2	Current State	当下正在做什么？未完成的任务、下一步要做什么
3	Task specification	用户要建什么？任何设计决策或解释性上下文
4	Files and Functions	哪些文件重要？各自装了什么、为什么相关
5	Workflow	常用哪些 bash 命令、什么顺序、输出怎么解读
6	Errors & Corrections	遇到过哪些 error、怎么修的、用户纠正了什么、哪些路子走不通
7	Codebase and System Documentation	重要系统组件、它们怎么协作
8	Learnings	什么方式奏效了、什么不行、要避免什么（不重复其他 section 的内容）
9	Key results	如果用户明确要过某个结果（答案、表格、文档），在这里原样保留
10	Worklog	一步一步做了什么尝试，每步极简摘要

这里要和后面传统 LLM Compact 的 9 段式摘要区分开——那是 Autocompact 兜底路径临时要模型生成的摘要格式，Section 和这里不一样（比如有 "All user messages"、"Current Work"、"Optional Next Step" 等更偏"对话上下文"的项）。Session Memory 的 10 段模板更偏"项目记忆"。

什么时候触发

分两层看：后台什么时候更新 summary.md vs Autocompact 什么时候读它。

后台更新触发（默认阈值）：

首次初始化：当前 messages 达到 10000 tokens
增量更新条件：(token 增长 ≥ 5000 && 工具调用 ≥ 3) || (token 增长 ≥ 5000 && 最近一轮无工具调用)
只在 querySource === 'repl_main_thread' 上跑，subagent / teammate 不跑

Autocompact 调用时机（子路径 A 的 first-try 入口）：

shouldAutoCompact() 判定要压
等任何正在进行的后台 extraction 跑完
读 summary.md；如果文件不存在或仍是空模板，返回 null 让位给兜底
否则执行压缩

`lastSummarizedMessageId` 的生命周期

这是 Session Memory Compact 的核心状态之一，决定了"哪一条消息之后才属于保留区"。不弄清它，看不懂后面的保留算法。

语义：最后被 summary.md 吸收掉的消息 uuid

也就是说，uuid ≤ lastSummarizedMessageId 的消息都已经被 Session Memory 消化过了；新消息（uuid > lastSummarizedMessageId）才是下一次 extraction 要处理的增量。

更新时机与更新值

后台 extraction 结束之后，不是无条件更新，而是有一道安全闸：

CodeBlock Loading...

为什么要加这道闸？

因为 lastSummarizedMessageId 会被 compact 阶段拿来算"保留区起点"。如果在"assistant 刚发起 tooluse，toolresult 还没返回"那一刻更新了它，后续 compact 就可能把 tooluse 归入"已 summary"，把 toolresult 归入"保留区"——API 请求就会报 "toolresult 找不到对应的 tooluse" 的 400。这个闸门保证更新发生在对话的自然断点。

保留窗口算法详解

源码里的 calculateMessagesToKeepIndex() 做的事用伪代码写出来就是：

CodeBlock Loading...

几个关键点：

下限是 AND，不是 OR。一定要"够 token 数并且够 text-block 数"。这是为了避免退化场景——比如 50K 纯 tool_result（一堆大文件 Read）就满足 tokens，但只有 2 条有实际文本的消息，模型拿到几乎看不到你们对话的连续性。加 text-block 下限保证"至少 5 条真正在说话的消息"留下来。
上限是硬停。一旦 totalTokens 超过 40K，循环 break 不再往前扩。这是保留区的容量上限，不是最低保障。
扫描方向是"起点倒着向前走"。终点（messages 数组末尾）永远不动；动的是起点。每次 i-- 就把更早的一条消息纳入保留区。
不跨 compact boundary floor。如果前面已经 compact 过，保留区的向前扩展止步于上一次 compact 的边界。
最后还有 API invariants 对齐：如果起点恰好切在了 toolresult 上、但配对的 tooluse 在更早的 assistant message 里，会把那条 assistant 也往前拉进来。同理处理 thinking block 的 merge 需求。

post-compact attachments 全景

很多人以为 compact 的输出就是"boundary + summary + 最近消息"三件套。实际上后面还挂着一串 attachment，它们才是 compact 后模型能快速继续干活的关键。

buildPostCompactMessages() 的拼装顺序是：

CodeBlock Loading...

这 8 种 attachment 的触发条件各不相同：

#	Attachment 类型	注入内容	什么时候会出现
1	`file_reference`	最近读过的文件，原文摘回	有最近 Read 过的文件且不在保留区里
2	`plan_file_reference`	当前 session 的 plan 文件	有 active plan
3	`invoked_skills`	本会话激活过的 skills	激活过任何 skill
4	`plan_mode`	plan mode 状态提示	当前处于 plan mode
5	`task_status`	后台运行中的 agent / task 状态	有后台 async agent 在跑
6	`deferred_tools_delta`	工具列表相较 compact 前的增减	工具列表有变化
7	`agent_listing_delta`	agent 列表的增减	agent 列表有变化
8	`mcp_instructions_delta`	MCP 指令的变动	MCP instructions 有变化

预算（默认常量）：

文件 attachment：最多 5 个、合计 ≤ 50K tokens、单文件 ≤ 5K tokens
Skill attachment：合计 ≤ 25K tokens、单 skill ≤ 5K tokens

换句话说，Claude Code 对"物品栏里能塞什么"做了严格的预算控制——summary 负责告诉你"我们在干什么"，attachments 负责给你"继续干的原材料"。两者分工非常清楚。

完整示例：两小时会话压缩前后

场景：你和 Claude 讨论了 2 小时项目 auth 重构，中间跑过 40 多次工具调用、改了十几个文件。上下文涨到接近阈值，最近几轮正在实现 AuthSession.refresh()。后台的 summary.md 一直在被更新。

第 0 步：`summary.md` 当前内容（磁盘上）

按 10 段式模板填充后大概长这样（下面展示前几段的示例填充，真实文件 10 段都会有）：

CodeBlock Loading...

注意这份文件是固定模板 + 后台 agent 填充的，不是自由格式日记。模板中每个 section 下面都有一条斜体 guidance（比如 "What is actively being worked on right now?"），后台 agent 看着这些 guidance 往里填。

第 1 步：压缩前的 messages 数组（简化示意）

CodeBlock Loading...

第 2 步：Autocompact 触发，选 Session Memory Compact 路径

算法做三件事：

切分界点：基于 lastSummarizedMessageId = u128，u129 及之后都属于"保留区"
调整 API invariants：保留区第一条是 assistant + tooluse，配对的 toolresult 也在保留区里——ok，无需前补
生成 summary message：把 summary.md 的正文包成一条 user message

第 3 步：压缩后的 messages 数组

CodeBlock Loading...

值得注意的几件事

加粗那行："recent messages 原样保留" 是 SM-compact 相对传统 LLM compact 最本质的区别。模型拿到的还是你真的说过的话、真的跑过的工具、真的得到的结果——不是一份 LLM 转述。
summary 来自文件，不是临时调模型。因为 summary.md 是后台持续维护的，Autocompact 触发这一刻客户端不需要再发起一次 API 调用，直接读磁盘。这就是为什么它是 first try——比兜底的 LLM Compact 快、便宜。
API invariants 不会破。保留区的 toolresult 必然能找到对应的 tooluse。算法会检查并在必要时往前补 tool_use 所在的 assistant message，哪怕它本来位置早于 lastSummarizedMessageId。
preservedSegment 是 relink 的钩子。compact boundary 上的 headUuid / anchorUuid / tailUuid 记下了"保留段是从哪里到哪里"。resume 时靠这三个 UUID 把 compact 后的视图和原始 transcript 重新连起来。
post-compact attachments 不是摘要的一部分，是"物品栏"。这批附件的作用是：summary 告诉你"我们在改 AuthSession"，attachments 保证 AuthSession.ts 的最新源码就在上下文里，不需要再 Read 一次。这个预算是固定的 50K token，单文件 5K，最多 5 个。

子路径 B：传统 LLM Compact（9 段式摘要）

是什么

Session Memory Compact 失败时（最常见的原因：summary.md 还没到初始化阈值就爆了）走的老路径。做法很直接：临时调一次模型，让它给当前会话生成一份 9 段式结构化摘要，然后用这份摘要替换原始历史。

什么时候触发

trySessionMemoryCompaction() 返回 null 时。

触发后发生什么

核心是一次"对话的对话"——客户端用当前 messages 构造一个新的 API 请求：

CodeBlock Loading...

真正发给摘要模型的 api_messages 数组大概长这样——中间那一大段是当前会话的完整历史原文，方括号标出省略的部分：

CodeBlock Loading...

调用参数里还有三个关键设定：

system 固定一句话："You are a helpful AI assistant tasked with summarizing conversations."
thinkingConfig 显式禁用——摘要任务不需要 extended thinking
querySource = "compact"——标记这是一次"摘要调用"，不会再次触发 compact / snip 等上下文管理流程（避免递归）

模型返回的应该是 <analysis>...</analysis><summary>...</summary> 两段纯文本。客户端从中提取 <summary> 部分作为 9 段式摘要正文，再走 buildPostCompactMessages() 拼成新的主线程 messages（和 Session Memory Compact 共用同一个拼装函数）。

摘要固定要求 9 个 section（完整 prompt 见本节末附录）：

CodeBlock Loading...

如果 compact 自己都 prompt-too-long 怎么办

这是一个很容易忽视但非常关键的自救机制。上面那次"对话的对话"本身也是一次 API 调用，它也可能返回 PTL 错误——特别是会话已经巨大才触发 compact 时，把"当前历史 + 长 prompt"一起发出去很容易超 token 限额。

Compact 不会让会话彻底卡死，最多允许 3 次 PTL retry，流程如下：

CodeBlock Loading...

对应的伪代码：

CodeBlock Loading...

几点值得注意的设计选择：

按 round 为单位丢，不是按单条 message。一个 round 粗略对应 "user input → assistant 的一系列 tooluse/thinking → 最终 text 回复"。按 round 丢能保证 tooluse 和 tool_result 不被切断，避免制造新的 API invariants 违规。
tokenGap-based 丢弃优先级更高。API 明确告诉你"你超了 X 个 token"时，准确丢掉 X 个 token 的 round 就够了；只有在拿不到 tokenGap 的情况下才兜底丢 20%。
synthetic meta marker 是 structural tax。[earlier conversation truncated for compaction retry] 这行不是给用户看的、也不是给模型"真"读的——它纯粹是为了满足"messages 第一条必须是 user"这个 API 约束。
3 次 retry 之上就放弃。如果砍三次还是 PTL，说明会话里有超大单条消息（比如一个 100K tokens 的 tool_result 没被 Tool Result Budget 拦下），这时 compact 已经无力回天，直接 abort。

压缩后主线程的 messages 结构

摘要模型返回后，客户端用 buildPostCompactMessages() 重建主线程 messages：

CodeBlock Loading...

和 Session Memory Compact 最本质的差别只有一个：没有 messagesToKeep 段——传统 LLM Compact 用摘要替换所有历史，recent messages 原文不保留。其他（boundary / summary / attachments / hooks 的顺序和预算）完全一致，因为用的是同一个拼装函数。

Post-compact 恢复的预算是写死的常量（默认）：

最多恢复 5 个文件
全部 attachment 合计 50K token
单文件 5K token
单 skill 5K token
Skills 合计 25K token

场景

新开会话，和 Claude 讨论了一个很紧凑的问题——几分钟内就跑了十几个大工具调用，上下文直接冲上限。这时 summary.md 还没到初始化阈值（10K tokens 是"够稳定"的门槛，但你这会话是"短时间内密集"）。

Session Memory Compact 返回 null，回退到传统 LLM Compact：

主线程先挂起
临时调一次模型，吐出 9 段式结构化摘要
用摘要 + 最近读过的几个关键文件 + 当前 plan 重建主线程 messages
会话继续，但 recent messages 原文不保留——这是它和 Session Memory Compact 最大的差别

附录：完整 compact prompt 原文

为了方便核对，下面是 getCompactPrompt() 最终拼出来、发给摘要模型的 user message 全文。它由 4 段拼接而成：

CodeBlock Loading...

完整原文（未配置 customInstructions 的情况）：

CodeBlock Loading...

这段 prompt 里几个值得留意的 prompt engineering 细节：

三次硬性禁止工具调用：开头 CRITICAL、段中 "Tool calls will be REJECTED"、结尾再 REMINDER。对 tool-calling 能力很强的模型，这种高频硬约束是必要的——只说一次，模型还是会忍不住去 Read 验证一下
强制两段式输出 <analysis> → <summary>：前者是模型"先想一遍"，后者才是真正会被写进 transcript 的摘要。分开是为了让思考过程不直接污染摘要正文
第 6 段 "All user messages" 是反失真防御：强制列出所有非 tool_result 的用户消息，避免模型只捡自己愿意记的事
第 9 段要求 "direct quotes"：next step 必须带原文摘录，防止摘要后出现"任务漂移"——compact 最怕的就是模型在总结时悄悄改了用户的意图
customInstructions 作为尾插槽：用户可以通过 CLAUDE.md 或专门的 compact instructions 给这个 prompt 加后缀，比如 "focus on typescript code changes" / "include test output verbatim"

分层设计的业务启示

如果把五级 + 两条子路径放到一张表里看：

层级	触发时机	作用对象	代价	是否调模型
Tool Result Budget	工具返回 & 发请求前	单个 tool_result	极低	否
Snip	每请求 / 模型主动	整条 message	低	否（模型主导）
Microcompact (time)	60min 静默后	旧 tool_result.content	低	否
Microcompact (cached)	每请求（支持 cache）	服务端缓存视图	极低	否
Context Collapse	每请求	分段归档 + summary placeholder	中	是（摘要生成）
Session Memory Compact	Autocompact 首选	早期历史 → summary.md	中（读磁盘）	后台 agent 维护文件
传统 LLM Compact	Autocompact 兜底	全量历史 → 9 段摘要	高（主线程级 LLM 调用）	是

几点值得注意的设计选择：

"便宜的事情先做"。Tool Result Budget 只是字符数比较 + 写文件，几乎零成本；LLM Compact 是主线程级别的 API 调用，是重活。流水线把廉价、精细的处理放前面，把昂贵、粗糙的处理放后面，典型的 cost-aware pipeline。
"能不调模型就不调"。直到最后一级 Autocompact 的兜底路径，才会真的花一次 API 调用做摘要。前面所有级别要么是机械替换、要么是删 UUID、要么是服务端 cache 指令。
"保住最近的原文"是一个清晰的价值排序。Session Memory Compact 的所有复杂性——后台持续维护 summary.md、lastSummarizedMessageId bookkeeping、API invariants 修复——都在保护同一个目标：recent messages 原样保留。因为开发者的"正在干的事"往往需要原文细节，而"上下文铺垫"只要知识性的摘要就够。
每一级都是 messages 数组的修改。对 Anthropic API 来说没有什么神秘压缩参数，所有机制的落点都在 payload 里那个数组。唯一的例外是 Cached Microcompact 的 cache_edits 和 API-native Context Management（clear_thinking_* / clear_tool_uses_*），那是服务端层面的约定。

写在最后

如果你在做 AI 应用的上下文管理，这条流水线至少给了三个直接能借鉴的点：

分层而不是单点。不要只有一个"到阈值就压"的大杀器。不同规模、不同类型的膨胀适合不同代价的处理。
保住原文优先于总结。模型读原文的效果几乎总是好过读自己的摘要，能保留就保留。
压缩机制本身也要有自救。你的 compact 逻辑调用的那个 API 请求，如果它自己 prompt-too-long 了怎么办？Claude Code 专门写了 3 次 PTL retry + synthetic marker，值得借鉴。

开场：我们到底在聊什么

这篇就是把这条流水线彻底拆开，告诉你每一级：

是什么
什么时候触发
触发后发生什么
messages 数组前后怎么变
一个实际场景

先对齐一个共识：messages 数组才是唯一真相

对 Anthropic API 来说，所有"上下文管理"归根到底就是客户端决定一件事——

这一轮我要把哪一组 messages 发过去。

记住这一点：下面讲的每一级，都是在重构 messages 数组。

总览：五级流水线

每次要向 API 发请求之前，Claude Code 按这个顺序跑：

TEXT

Tool Result Budget → Snip → Microcompact → Context Collapse → Autocompact
                                                                  │
                                            ┌─────────────────────┤
                                            ▼                     ▼
                                 Session Memory Compact    传统 LLM Compact
                                       (首选路径)              (兜底路径)

CodeBlock Loading...

设计直觉是：

最前面的机制最便宜、最精细，只在入口处拦截或替换小块内容
越往后代价越高、作用越粗，最后一步才会真的去调一次模型来做摘要
每一级都在前一级没解决问题时才接管，避免"上来就做重活"

下面挨个拆开。

第一级：Tool Result Budget（工具结果预算）

是什么

这是入口限流，不是对历史的压缩。

当一个工具（比如 Bash、Read、Grep）执行完，返回的 tool_result 要塞进 messages 数组之前，先过一道预算检查：太大就不让原文进去。

什么时候触发

两个时间点：

单结果时刻：每个工具刚执行完，结果准备包装成 tool_result block 时
聚合时刻：每次向 API 发请求之前，对整条 user message 里的所有 tool_result 做一次总预算检查（主要防并行工具同轮返回的叠加）

触发后发生什么

两级阈值（默认值）：

单结果 50K 字符：超了就持久化到磁盘的 tool-results/<tool_use_id>.txt，messages 里只留一个"引用消息"
单 message 里所有 tool_result 合计 200K 字符：超了就挑最大的那个做同样替换，直到回到预算内

替换后的内容是固定模板：

TEXT

<persisted-output>
Output too large (317842 chars). Full output saved to: /project/.claude/tool-results/toolu_abc.txt

Preview (first 2 KB):
...前 2KB 原文...
</persisted-output>

CodeBlock Loading...

messages 数组前后对比

JSONC

// 变换前 —— 原始 tool_result
{
  "type": "user",
  "message": {
    "content": [
      {
        "type": "tool_result",
        "tool_use_id": "toolu_abc",
        "content": "<300KB 的 bash stdout>"
      }
    ]
  }
}

// 变换后 —— preview + 磁盘路径
{
  "type": "user",
  "message": {
    "content": [
      {
        "type": "tool_result",
        "tool_use_id": "toolu_abc",
        "content": "<persisted-output>Output too large (317842 chars). Full output saved to: /project/.claude/tool-results/toolu_abc.txt\n\nPreview (first 2 KB):\n...\n</persisted-output>"
      }
    ]
  }
}

CodeBlock Loading...

场景

你让 Claude 跑 cat huge.log，stdout 500KB。如果这 500KB 原文直接进上下文：

本轮 API 直接吃掉大半个窗口
后续每一轮还会重复发一次（没有任何 client cache 能解决）
模型其实也读不动这么长的 log

第二级：Snip（精准裁剪）

是什么

这是整个流水线里唯一由模型主导的一级——其他几级都是 client 自动做决定。

删除单位：一整个 user turn

理解 Snip 最重要的一点是它按 user turn 为单位工作，不是按单条 message。

一个 user turn 长这样：

TEXT

[ user input ][ assistant/tool_use ][ tool_result ][ assistant/tool_use ][ tool_result ]...
 ↑ 这里挂一个 [id:xxxxxx]                                                                  ↑ 直到下一条真正的 user input

CodeBlock Loading...

当模型对某个 ID 调用 SnipTool，整个从该 user input 开始、到下一条 user input 之前的所有消息都从后续 API 请求里消失。

什么时候触发

三个时间点：

每次向 API 发请求前：在发送的那一份 messages 副本上，给所有非 meta 的 user message 追加 [id:<短ID>] 尾标（tool_result 类的 user message 不算"真正的 user input"，不会挂 ID）
上下文每增长约 10K tokens 却没 snip 过：注入一条 nudge attachment，提醒模型"可以 snip 一下"
模型主动调用 SnipTool：真正执行删除

触发后发生什么

短 ID 由 message UUID 派生（hex 前 10 位转 base36 前 6 位），每条 user input 都有稳定、短小、模型容易复述的 ID
模型调用 SnipTool 传入一个或多个 ID，client 把对应 user turn 的所有 UUID 收集起来，从 in-memory 数组里删掉
removedUuids 写到 transcript 边界；resume 时重放保证持久化
删除后对受影响的 parentUuid 做回溯修复，避免 dangling 链

关键细节：[id:xxxxxx] 这个 tag 只加在"发给 API 的 copy"上，不会写回原始存储。transcript 里用户的原话永远干净，只有 model-visible 的那份带 tag。

完整示例：从多轮对话到一次 Snip

第 0 步：原始 messages 数组（发给 API 的版本）

JSONC

[
  // —— Turn 1：用户要求调研 TODO ——
  {
    "type": "user",
    "message": {
      "content": "帮我找一下项目里所有的 TODO，我想集中处理一下\n[id:abc123]"
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "好的，我用 Grep 搜索一下。" },
        {
          "type": "tool_use",
          "id": "toolu_01",
          "name": "Grep",
          "input": { "pattern": "TODO", "path": "src/" }
        }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_01",
          "content": "src/auth/login.ts:42: // TODO: handle refresh token\nsrc/api/user.ts:89: // TODO: validate input\n... 共 23 处 ..."
        }
      ]
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "我打开 login.ts 看一下那个 refresh token 的 TODO 具体场景。" },
        {
          "type": "tool_use",
          "id": "toolu_02",
          "name": "Read",
          "input": { "file_path": "src/auth/login.ts" }
        }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_02",
          "content": "<login.ts 全文，约 4KB>"
        }
      ]
    }
  },

  // —— Turn 2：用户改主意，转去修 bug ——
  {
    "type": "user",
    "message": {
      "content": "算了先别搞 TODO 了，线上登录报 500 的 bug 更紧急，先修这个\n[id:def456]"
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "理解，TODO 先放下。登录 500 我查一下最近的错误日志。" },
        {
          "type": "tool_use",
          "id": "toolu_03",
          "name": "Bash",
          "input": { "command": "tail -200 /var/log/app/error.log" }
        }
      ]
    }
  }
]

CodeBlock Loading...

第 1 步：模型主动调用 SnipTool

JSONC

{
  "type": "tool_use",
  "id": "toolu_04",
  "name": "Snip",
  "input": {
    "ids": ["abc123"],
    "reason": "用户已改方向放弃 TODO 调研，原轮次的 Grep 结果与 login.ts 全文与当前 bug 修复无关"
  }
}

CodeBlock Loading...

第 2 步：Snip 执行后，下一轮发给 API 的 messages 数组

Turn 1 整段（含 user input + 两次 assistant tooluse + 两次 toolresult，共 5 条 message）被物理删除：

JSONC

[
  // Turn 1 整段消失，直接从 Turn 2 的 user input 开始

  {
    "type": "user",
    "message": {
      "content": "算了先别搞 TODO 了，线上登录报 500 的 bug 更紧急，先修这个\n[id:def456]"
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "理解，TODO 先放下。登录 500 我查一下最近的错误日志。" },
        {
          "type": "tool_use",
          "id": "toolu_03",
          "name": "Bash",
          "input": { "command": "tail -200 /var/log/app/error.log" }
        }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_03",
          "content": "<日志 tail 输出>"
        }
      ]
    }
  }
]

CodeBlock Loading...

值得注意的几件事

删除是成对的。toolu_01 的 tool_use 和它对应的 tool_result 一起消失，toolu_02 同理。这样 API 侧不会出现"有 toolresult 但找不到 tooluse"或反过来的错误。
[id:def456] 没被动过。Snip 精准作用在 abc123 所在的那个 user turn，不会误伤后续 turn。
transcript 落盘版不变。如果你 resume 这个会话，Claude Code 会根据 removedUuids 重放同样的删除结果，让 model-visible 视图保持一致——但用户"原话"本身始终保留在磁盘上，随时可审计。
上下文节省明显。Turn 1 的 Grep 结果 + login.ts 全文大概 6~8K tokens，一次 Snip 直接省掉，而且这是精准保留原文的做法，不是总结。

为什么这么设计

当你只是 pivot 一次方向、想丢掉某段岔路，用 Snip
当整个上下文已经超载、没有明显的"某轮已作废"边界时，用后面的 Microcompact / Autocompact

第三级：Microcompact（轻量重写）

是什么

只处理这几个工具的结果：Read、Bash、Grep、Glob、WebSearch、WebFetch、Edit、Write。用户的文字、模型的思考、plan、attachment 它一概不动。

什么时候触发

两条独立路径：

路径 A：Time-based Microcompact

默认关闭，开启后：
距离上一条 assistant message 超过 60 分钟 +
在主线程 +
每次发请求前检查一次

路径 B：Cached Microcompact

Feature flag 开启 +
模型支持 cache editing +
在主线程 +
每次发请求前检查一次

触发后发生什么

Time-based 路径——直接改本地 messages：

按 tool id 找出所有"可压缩工具"的 tool_result
保留最近 5 个，其余的 content 原文替换成字面字符串 [Old tool result content cleared]
顺带 reset cached microcompact 的模块状态（避免 cache 引用失效的 tool id）

Cached 路径——本地 messages 不变，而是在 API 层带 cache_edits：

本地数组里那些旧 tool_result 看上去毫发无损
但向 Anthropic 发请求时，payload 多了一段 cache_edits 指令，告诉服务端"你缓存里编号 xxx 的那几段我不要了"
好处是 prompt cache prefix 尽量保住，避免 time-based 那种"一动就全 miss"

另外还有一层API-native Context Management，不是客户端做的，而是 Anthropic API 原生支持的策略：

JSONC

{ "type": "clear_thinking_20251015", "keep": "all" }
{
  "type": "clear_tool_uses_20250919",
  "trigger": { "type": "input_tokens", "value": 180000 },
  "clear_at_least": { "type": "input_tokens", "value": 140000 }
}

CodeBlock Loading...

这两个块加在 API 参数里，由服务端在超过 180K input tokens 时自动清理 tool_use 类内容。

完整示例：Time-based 路径

第 0 步：离开前的 messages 数组

JSONC

[
  {
    "type": "user",
    "message": { "content": "帮我了解一下这个项目的结构\n[id:abc123]" }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "我先看看目录。" },
        { "type": "tool_use", "id": "toolu_01", "name": "Glob", "input": { "pattern": "**/*.ts" } }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_01",
          "content": "src/index.ts\nsrc/auth/login.ts\nsrc/api/user.ts\n... 共 147 个文件 ..."
        }
      ]
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "读一下入口文件。" },
        { "type": "tool_use", "id": "toolu_02", "name": "Read", "input": { "file_path": "src/index.ts" } }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_02",
          "content": "import { createApp } from './app'\nimport { loadConfig } from './config'\n... <完整 4KB 源码> ..."
        }
      ]
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "再扫一下 config 相关的定义。" },
        { "type": "tool_use", "id": "toolu_03", "name": "Grep", "input": { "pattern": "loadConfig", "path": "src/" } }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_03",
          "content": "src/config.ts:12: export function loadConfig() {\nsrc/config.ts:34:   return loadConfig()\n... 共 8 处匹配 ..."
        }
      ]
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "项目结构大致是单入口 + 按模块分的 config/auth/api 三块。你想先深入哪一块？" }
      ]
    }
    // ← 最后一条 assistant 的 timestamp 记作 12:00
  }
]

CodeBlock Loading...

第 1 步：用户 70 分钟后回来

JSONC

{
  "type": "user",
  "message": { "content": "那就从 auth 模块开始\n[id:def456]" }
  // ← 当前时间 13:10，距离上一条 assistant (12:00) 已 70 分钟
}

CodeBlock Loading...

这一刻，Time-based Microcompact 的触发条件成立：主线程 + 有上一条 assistant + gap > 60 分钟。

第 2 步：Microcompact 扫一遍 messages 数组

从旧到新找出所有"可压缩工具"的 tool_result：

TEXT

toolu_01 (Glob)   toolu_02 (Read)   toolu_03 (Grep)
   ↑ 最旧              ↑ 第二              ↑ 最新

CodeBlock Loading...

保留最近 keepRecent = 2 个（toolu_02 / toolu_03），其余的 content 替换为占位符。

第 3 步：变换后的 messages 数组

JSONC

[
  // user_1 + 对应 assistant.tool_use 完全不动
  {
    "type": "user",
    "message": { "content": "帮我了解一下这个项目的结构\n[id:abc123]" }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "我先看看目录。" },
        { "type": "tool_use", "id": "toolu_01", "name": "Glob", "input": { "pattern": "**/*.ts" } }
      ]
    }
  },

  // ★ toolu_01 的 tool_result.content 被替换
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_01",
          "content": "[Old tool result content cleared]"
        }
      ]
    }
  },

  // toolu_02 / toolu_03 原文保留（落在 keepRecent 窗口内）
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "读一下入口文件。" },
        { "type": "tool_use", "id": "toolu_02", "name": "Read", "input": { "file_path": "src/index.ts" } }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_02",
          "content": "import { createApp } from './app'\nimport { loadConfig } from './config'\n... <完整 4KB 源码> ..."
        }
      ]
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "再扫一下 config 相关的定义。" },
        { "type": "tool_use", "id": "toolu_03", "name": "Grep", "input": { "pattern": "loadConfig", "path": "src/" } }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_03",
          "content": "src/config.ts:12: export function loadConfig() {\nsrc/config.ts:34:   return loadConfig()\n... 共 8 处匹配 ..."
        }
      ]
    }
  },

  // assistant 收尾 + 新 user input 原样附在后面
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "项目结构大致是单入口 + 按模块分的 config/auth/api 三块。你想先深入哪一块？" }
      ]
    }
  },
  {
    "type": "user",
    "message": { "content": "那就从 auth 模块开始\n[id:def456]" }
  }
]

CodeBlock Loading...

值得注意的几件事

tool_use_id 不删——toolu_01 的 tooluse（含参数 pattern: "**/*.ts"）完整保留，只是对应的 toolresult content 被替换成字面字符串。API 侧的 tool_use ↔ tool_result 配对关系依然成立。
模型仍能判断"跑过哪些工具"。它能看到 toolu_01 是一次 Glob("**/*.ts") 调用，只是具体返回已作废。如果后续真的需要，它可以再调一次 Glob 重新取。
用户文本、assistant 的思考一概不动——Microcompact 只针对 tool_result 的 content。
体积收益有多大。这个例子里 toolu01 的 Glob 输出大概 3KB，一个占位符 31 字节。真实会话里你很可能有 10+ 个旧 toolresult，每个几 KB 到几十 KB，省下来是实打实的 5K~50K tokens。

Cached 路径的差异

Cached 路径的关键是本地 messages 完全不变，替换发生在服务端缓存那边。对比如下（沿用上面的场景但走 Cached 路径）：

JSONC

// 本地 messages 数组：原文全部保留，没有任何 [Old ... cleared]

// 发给 Anthropic 的 API request body（概念化）：
{
  "model": "claude-xxx",
  "messages": [ /* 上面那个完整数组，原文不变 */ ],
  "cache_edits": [
    // 客户端维护的 pending edits，告诉服务端
    // "你 prompt cache 里对应 toolu_01 那段可以删掉了"
    { "action": "clear", "tool_use_id": "toolu_01" }
  ]
}

CodeBlock Loading...

本质上就是缓存的冷热区分

退一步看，两条路径的分工其实就是一个 cache state machine：

热缓存（cache 仍有价值）→ 走 Cached 路径，API 层 cache_edits 精细化编辑
冷缓存（60 分钟无活动，cache 多半已 expire）→ 走 Time-based 路径，放弃 cache 直接裁切本地 messages

场景小结

活跃使用中：Cached 路径在后台静默裁剪服务端 cache，本地体感 0 变化，但请求更便宜
长时间暂停后回来：Time-based 路径直接本地清老 tool_result，舍弃 cache 命中换上下文空间
接近 token 上限时：API-native 的 clear_tool_uses_20250919 兜底，由服务端在 180K 阈值上自动清理

第四级：Context Collapse

从 transcript 里能看到 Collapse 会落两种记录：

marble-origami-commit：append-only 的 splice 指令，记录"如何把某段历史折叠成一条 summary placeholder"，包含 collapseId / summaryUuid / summaryContent / firstArchivedUuid / lastArchivedUuid
marble-origami-snapshot：last-wins 的 staged 状态快照，包含 staged spans / armed 标志 / lastSpawnTokens

第五级：Autocompact（重度兜底）

是什么

当前面四级都没能把上下文压下来时的最后一道防线。它本身不做压缩，而是两条子路径二选一：

首选：Session Memory Compact（读一份持续维护的 summary.md）
兜底：传统 LLM Compact（临时调一次模型做 9 段式摘要）

什么时候触发

shouldAutoCompact() 判断"上下文接近 token 上限"时。注意如果 Context Collapse 启用了，这一步会被直接跳过（让 Collapse 处理）。

触发后的流程是：

TEXT

autoCompactIfNeeded()
  ├── 先试 Session Memory Compact
  │     ├── 成功 → 返回结果
  │     └── 失败（null）→ 回退
  └── 传统 LLM Compact

CodeBlock Loading...

子路径 A：Session Memory Compact（结构化摘要）

是什么

核心思路是：不要等到上下文爆了才开始总结，平时就在后台持续维护一份结构化摘要文件，爆的时候直接读这份文件当摘要用。

这样的好处非常明确：

触发压缩时 zero API cost——不需要临时调模型，直接读磁盘
摘要质量更稳定——后台每隔一段时间更新一次，覆盖面比"临时生成一份"要系统
可以持续沉淀跨轮信息——做过的错误修正、对项目的认知，都能累积在一份文件里

Session Memory 文件的存储位置

文件名 summary.md，完整路径：

TEXT

{projectDir}/{sessionId}/session-memory/summary.md

CodeBlock Loading...

注意是每个 session 一份，不是项目级共享。理由很直接——不同 session 做不同的事，公用一份会互相串扰。

10 段式模板

#	Section	这个 section 装什么（guidance）
1	Session Title	一个信息密度高的 5-10 词 session 标题，无填充词
2	Current State	当下正在做什么？未完成的任务、下一步要做什么
3	Task specification	用户要建什么？任何设计决策或解释性上下文
4	Files and Functions	哪些文件重要？各自装了什么、为什么相关
5	Workflow	常用哪些 bash 命令、什么顺序、输出怎么解读
6	Errors & Corrections	遇到过哪些 error、怎么修的、用户纠正了什么、哪些路子走不通
7	Codebase and System Documentation	重要系统组件、它们怎么协作
8	Learnings	什么方式奏效了、什么不行、要避免什么（不重复其他 section 的内容）
9	Key results	如果用户明确要过某个结果（答案、表格、文档），在这里原样保留
10	Worklog	一步一步做了什么尝试，每步极简摘要

这里要和后面传统 LLM Compact 的 9 段式摘要区分开——那是 Autocompact 兜底路径临时要模型生成的摘要格式，Section 和这里不一样（比如有 "All user messages"、"Current Work"、"Optional Next Step" 等更偏"对话上下文"的项）。Session Memory 的 10 段模板更偏"项目记忆"。

什么时候触发

分两层看：后台什么时候更新 summary.md vs Autocompact 什么时候读它。

后台更新触发（默认阈值）：

首次初始化：当前 messages 达到 10000 tokens
增量更新条件：(token 增长 ≥ 5000 && 工具调用 ≥ 3) || (token 增长 ≥ 5000 && 最近一轮无工具调用)
只在 querySource === 'repl_main_thread' 上跑，subagent / teammate 不跑

Autocompact 调用时机（子路径 A 的 first-try 入口）：

shouldAutoCompact() 判定要压
等任何正在进行的后台 extraction 跑完
读 summary.md；如果文件不存在或仍是空模板，返回 null 让位给兜底
否则执行压缩

`lastSummarizedMessageId` 的生命周期

这是 Session Memory Compact 的核心状态之一，决定了"哪一条消息之后才属于保留区"。不弄清它，看不懂后面的保留算法。

语义：最后被 summary.md 吸收掉的消息 uuid

也就是说，uuid ≤ lastSummarizedMessageId 的消息都已经被 Session Memory 消化过了；新消息（uuid > lastSummarizedMessageId）才是下一次 extraction 要处理的增量。

更新时机与更新值

后台 extraction 结束之后，不是无条件更新，而是有一道安全闸：

PYTHON

# 伪代码
def update_last_summarized_message_id_if_safe(messages):
    # 最后一条 assistant 还在等 tool_result 吗？
    if has_tool_calls_in_last_assistant_turn(messages):
        return  # 不更新，避免把一个未闭合的 tool pair 切在中间

    last = messages[-1]
    if last.uuid:
        set_last_summarized_message_id(last.uuid)

CodeBlock Loading...

为什么要加这道闸？

保留窗口算法详解

源码里的 calculateMessagesToKeepIndex() 做的事用伪代码写出来就是：

PYTHON

# 默认配置
MIN_TOKENS = 10_000
MIN_TEXT_BLOCK_MESSAGES = 5
MAX_TOKENS = 40_000

def calculate_messages_to_keep_index(messages):
    # 1) 找分界点
    last_index = find_index(messages, by_uuid=last_summarized_message_id)
    start = last_index + 1  # 默认：summary 消化过的那条之后，全部保留

    # 2) 统计当前保留区的 token / text-block 数
    total_tokens = sum_tokens(messages[start:])
    text_block_count = count_text_block_messages(messages[start:])

    # 3) 如果已经满足下限 → 直接收工
    if total_tokens >= MIN_TOKENS and text_block_count >= MIN_TEXT_BLOCK_MESSAGES:
        return adjust_index_to_preserve_api_invariants(messages, start)

    # 4) 否则，从 start-1 倒着向前扩展起点（往早拿）
    #    直到同时满足 (total_tokens >= MIN_TOKENS AND text_block_count >= MIN)
    #    或者 total_tokens >= MAX_TOKENS 硬上限先到
    #    或者撞到 floor（上一个 compact boundary，不能跨）
    for i in range(start - 1, floor - 1, -1):
        total_tokens += token_count(messages[i])
        if has_text_blocks(messages[i]):
            text_block_count += 1
        start = i

        if total_tokens >= MAX_TOKENS:
            break  # 硬上限优先

        if total_tokens >= MIN_TOKENS and text_block_count >= MIN_TEXT_BLOCK_MESSAGES:
            break  # 下限已同时满足

    # 5) 最后对齐 API 约束
    return adjust_index_to_preserve_api_invariants(messages, start)

CodeBlock Loading...

几个关键点：

下限是 AND，不是 OR。一定要"够 token 数并且够 text-block 数"。这是为了避免退化场景——比如 50K 纯 tool_result（一堆大文件 Read）就满足 tokens，但只有 2 条有实际文本的消息，模型拿到几乎看不到你们对话的连续性。加 text-block 下限保证"至少 5 条真正在说话的消息"留下来。
上限是硬停。一旦 totalTokens 超过 40K，循环 break 不再往前扩。这是保留区的容量上限，不是最低保障。
扫描方向是"起点倒着向前走"。终点（messages 数组末尾）永远不动；动的是起点。每次 i-- 就把更早的一条消息纳入保留区。
不跨 compact boundary floor。如果前面已经 compact 过，保留区的向前扩展止步于上一次 compact 的边界。
最后还有 API invariants 对齐：如果起点恰好切在了 toolresult 上、但配对的 tooluse 在更早的 assistant message 里，会把那条 assistant 也往前拉进来。同理处理 thinking block 的 merge 需求。

post-compact attachments 全景

很多人以为 compact 的输出就是"boundary + summary + 最近消息"三件套。实际上后面还挂着一串 attachment，它们才是 compact 后模型能快速继续干活的关键。

buildPostCompactMessages() 的拼装顺序是：

TEXT

[boundary marker]
  → [summary messages]
  → [messages to keep（保留区原样）]
  → [attachments ← 一批 meta 消息，按下面 8 种类型注入]
  → [hook results（session start hooks）]

CodeBlock Loading...

这 8 种 attachment 的触发条件各不相同：

#	Attachment 类型	注入内容	什么时候会出现
1	`file_reference`	最近读过的文件，原文摘回	有最近 Read 过的文件且不在保留区里
2	`plan_file_reference`	当前 session 的 plan 文件	有 active plan
3	`invoked_skills`	本会话激活过的 skills	激活过任何 skill
4	`plan_mode`	plan mode 状态提示	当前处于 plan mode
5	`task_status`	后台运行中的 agent / task 状态	有后台 async agent 在跑
6	`deferred_tools_delta`	工具列表相较 compact 前的增减	工具列表有变化
7	`agent_listing_delta`	agent 列表的增减	agent 列表有变化
8	`mcp_instructions_delta`	MCP 指令的变动	MCP instructions 有变化

预算（默认常量）：

文件 attachment：最多 5 个、合计 ≤ 50K tokens、单文件 ≤ 5K tokens
Skill attachment：合计 ≤ 25K tokens、单 skill ≤ 5K tokens

完整示例：两小时会话压缩前后

第 0 步：`summary.md` 当前内容（磁盘上）

按 10 段式模板填充后大概长这样（下面展示前几段的示例填充，真实文件 10 段都会有）：

# Session Title
Refactor auth middleware for compliance rewrite

# Current State
Migrating session token storage from cookies to encrypted Redis.
Pending: integration tests for multi-device login path.
Immediate next: finish `AuthSession.refresh()` branch.

# Task specification
- Remove raw session tokens from client cookies (legal/compliance request)
- Introduce AuthSession wrapper that holds an opaque id, with real data in Redis
- Preserve existing /login and /logout API shape; only storage layer changes
- Ensure refresh flow works across multi-device, no forced logout on other devices

# Files and Functions
- src/auth/AuthSession.ts (new): wraps opaque id + Redis-backed metadata
- src/auth/login.ts (updated): issues AuthSession on credential success
- src/auth/middleware.ts (updated): validates AuthSession on every request
- src/redis/sessionStore.ts (new): typed Redis gateway for session records

# Workflow
- `pnpm test auth/` to run auth unit tests
- `pnpm run dev:redis` to boot a local Redis for integration runs
- Error "ECONNREFUSED 6379" means Redis isn't up; start it before tests

# Errors & Corrections
- Early attempt used HMAC-signed cookies — rejected by legal (still stores session data client-side)
- First Redis schema used JSON strings — switched to hashes for partial-field updates
- User corrected: refresh must NOT invalidate other devices' session ids

# Codebase and System Documentation
- Auth path: login.ts → middleware.ts → AuthSession.ts → Redis
- Session ids are opaque; all real data lives in Redis under `session:{id}`
- TTL is sliding: every successful request extends expiry by 30 days

# Learnings
- Opaque id format needs to be URL-safe (base58 chosen over base64)
- Fail-closed fallback is acceptable because re-login UX is already smooth

# Key results
- (none yet — refresh implementation in progress)

# Worklog
- Drafted AuthSession type and Redis gateway
- Updated login/logout to issue/revoke AuthSession
- Updated middleware to validate AuthSession on each request
- Started refresh() branch; paused to handle multi-device concern

CodeBlock Loading...

第 1 步：压缩前的 messages 数组（简化示意）

JSONC

[
  // ========== 早期 100+ 条历史（共约 80K tokens）==========

  // UUID: u001 — 会话第一条 user input
  {
    "type": "user",
    "message": { "content": "我们得重构 auth 中间件，legal 那边给了新要求\n[id:a1b2c3]" }
  },
  { "type": "assistant", "message": { "content": [/* ... 长讨论 ... */] } },
  { "type": "user", "message": { "content": [/* tool_result: 读 middleware.ts */] } },
  // ... 此处省略 100+ 条交错的 user/assistant/tool_result ...

  // UUID: u128 — 这是 lastSummarizedMessageId（summary.md 消化到这里为止）

  // ========== 最近的若干条消息（共约 12K tokens，落在 10K ~ 40K 的保留窗口内）==========

  // UUID: u129
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "AuthSession.refresh 的主要分支我画出来了，先写核心路径。" },
        { "type": "tool_use", "id": "toolu_80", "name": "Read", "input": { "file_path": "src/auth/AuthSession.ts" } }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        { "type": "tool_result", "tool_use_id": "toolu_80", "content": "<AuthSession.ts 当前内容>" }
      ]
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "refresh 要更新 Redis TTL 并回写新 opaque id。写一版。" },
        { "type": "tool_use", "id": "toolu_81", "name": "Edit", "input": { "file_path": "src/auth/AuthSession.ts", "old_string": "...", "new_string": "..." } }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        { "type": "tool_result", "tool_use_id": "toolu_81", "content": "File edited." }
      ]
    }
  },
  {
    "type": "user",
    "message": { "content": "等下 refresh 里要考虑多设备场景\n[id:m9n8o7]" }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "好，那 refresh 不能作废其它 device 的 session id。让我把设计稍微调一下……" }
      ]
    }
  }
]

CodeBlock Loading...

第 2 步：Autocompact 触发，选 Session Memory Compact 路径

算法做三件事：

切分界点：基于 lastSummarizedMessageId = u128，u129 及之后都属于"保留区"
调整 API invariants：保留区第一条是 assistant + tooluse，配对的 toolresult 也在保留区里——ok，无需前补
生成 summary message：把 summary.md 的正文包成一条 user message

第 3 步：压缩后的 messages 数组

JSONC

[
  // ---- ① compact boundary（带 preservedSegment 元数据，供 resume 时 relink）----
  {
    "type": "user",
    "isMeta": true,
    "message": { "content": "<compact boundary>" },
    "compactMetadata": {
      "preservedSegment": {
        "headUuid": "u001",
        "anchorUuid": "u128",
        "tailUuid": "u134"
      }
    }
  },

  // ---- ② summary message（来自 summary.md，用 user role 注入）----
  {
    "type": "user",
    "isMeta": true,
    "message": {
      "content": "Below is a summary of the session so far:\n\n# Session Title\nRefactor auth middleware for compliance rewrite\n\n# Current State\nMigrating session token storage from cookies to encrypted Redis.\n...\n# Pending tasks\n- AuthSession.refresh() implementation\n- Integration tests for multi-device case\n- Rollout plan (feature flag name: `auth_opaque_sessions_v2`)"
    }
  },

  // ---- ③ recent messages：原样保留（u129 ~ u134，一字未动）----
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "AuthSession.refresh 的主要分支我画出来了，先写核心路径。" },
        { "type": "tool_use", "id": "toolu_80", "name": "Read", "input": { "file_path": "src/auth/AuthSession.ts" } }
      ]
    }
  },
  { "type": "user", "message": { "content": [/* toolu_80 tool_result 原文 */] } },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "refresh 要更新 Redis TTL 并回写新 opaque id。写一版。" },
        { "type": "tool_use", "id": "toolu_81", "name": "Edit", "input": {/* ... */} }
      ]
    }
  },
  { "type": "user", "message": { "content": [/* toolu_81 tool_result 原文 */] } },
  { "type": "user", "message": { "content": "等下 refresh 里要考虑多设备场景\n[id:m9n8o7]" } },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "好，那 refresh 不能作废其它 device 的 session id。让我把设计稍微调一下……" }
      ]
    }
  },

  // ---- ④ post-compact attachments（按触发条件依次注入）----

  // ④-1 file_reference：最近读过的 ≤5 个文件（50K token 预算，单文件 5K）
  {
    "type": "user",
    "isMeta": true,
    "message": {
      "content": [
        { "type": "text", "text": "<attachment: src/auth/AuthSession.ts>\n<完整内容 ≤5K tokens>" },
        { "type": "text", "text": "<attachment: src/auth/login.ts>\n<完整内容>" },
        { "type": "text", "text": "<attachment: src/redis/sessionStore.ts>\n<完整内容>" }
      ]
    }
  },

  // ④-2 plan_file_reference：active plan 文件
  {
    "type": "user",
    "isMeta": true,
    "message": { "content": "<plan: implement AuthSession.refresh with multi-device support>" }
  },

  // ④-3 invoked_skills：本会话激活过的 skills（单 skill ≤5K，合计 ≤25K）
  {
    "type": "user",
    "isMeta": true,
    "message": { "content": "<invoked skills: test-driven-development, systematic-debugging>" }
  },

  // ④-4 plan_mode：如果当前在 plan mode，附一条状态提示
  // （本例不在 plan mode，跳过）

  // ④-5 task_status：如果有后台 async agent 在跑
  // （本例没有后台 task，跳过）

  // ④-6 deferred_tools_delta / ④-7 agent_listing_delta / ④-8 mcp_instructions_delta
  // 这三种 delta 只在工具列表/agent 列表/MCP 指令相较 compact 前有变化时才注入
  // （本例假设都没变化，跳过）

  // ---- ⑤ session start hooks ----
  {
    "type": "user",
    "isMeta": true,
    "message": { "content": "<session start hook output>" }
  }
]

CodeBlock Loading...

值得注意的几件事

加粗那行："recent messages 原样保留" 是 SM-compact 相对传统 LLM compact 最本质的区别。模型拿到的还是你真的说过的话、真的跑过的工具、真的得到的结果——不是一份 LLM 转述。
summary 来自文件，不是临时调模型。因为 summary.md 是后台持续维护的，Autocompact 触发这一刻客户端不需要再发起一次 API 调用，直接读磁盘。这就是为什么它是 first try——比兜底的 LLM Compact 快、便宜。
API invariants 不会破。保留区的 toolresult 必然能找到对应的 tooluse。算法会检查并在必要时往前补 tool_use 所在的 assistant message，哪怕它本来位置早于 lastSummarizedMessageId。
preservedSegment 是 relink 的钩子。compact boundary 上的 headUuid / anchorUuid / tailUuid 记下了"保留段是从哪里到哪里"。resume 时靠这三个 UUID 把 compact 后的视图和原始 transcript 重新连起来。
post-compact attachments 不是摘要的一部分，是"物品栏"。这批附件的作用是：summary 告诉你"我们在改 AuthSession"，attachments 保证 AuthSession.ts 的最新源码就在上下文里，不需要再 Read 一次。这个预算是固定的 50K token，单文件 5K，最多 5 个。

子路径 B：传统 LLM Compact（9 段式摘要）

是什么

什么时候触发

trySessionMemoryCompaction() 返回 null 时。

触发后发生什么

核心是一次"对话的对话"——客户端用当前 messages 构造一个新的 API 请求：

PYTHON

# 伪代码
summary_prompt = build_compact_prompt()   # 完整原文见本节末"附录"
summary_request = { "role": "user", "content": summary_prompt }

api_messages = normalize(strip_images(strip_attachments([
  *get_messages_after_compact_boundary(messages),   # 当前历史（不含之前 compact 过的部分）
  summary_request                                    # 末尾追加"请总结"的 user message
])))

summary_text = call_api(
  messages = api_messages,
  system   = "You are a helpful AI assistant tasked with summarizing conversations.",
  thinking = DISABLED,
  source   = "compact"
)

CodeBlock Loading...

真正发给摘要模型的 api_messages 数组大概长这样——中间那一大段是当前会话的完整历史原文，方括号标出省略的部分：

JSONC

[
  // ========== 会话历史（去掉图片、去掉 reinjected attachments 之后）==========

  {
    "type": "user",
    "message": { "content": "帮我重构 auth 中间件，legal 那边给了新要求\n[id:a1b2c3]" }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "好的，先读一下 middleware.ts。" },
        { "type": "tool_use", "id": "toolu_01", "name": "Read", "input": { "file_path": "src/auth/middleware.ts" } }
      ]
    }
  },
  {
    "type": "user",
    "message": { "content": [{ "type": "tool_result", "tool_use_id": "toolu_01", "content": "<middleware.ts 原文>" }] }
  },

  // [...省略 N 条真实历史消息：继续的 user input / assistant tool_use / tool_result / thinking 等
  //    真实会话中这段通常 50~200 条、合计 80K~150K tokens，正是因为太大才会触发 compact...]

  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "refresh 的多设备分支写到一半，现在卡在是否需要全局广播 session 变更。" }
      ]
    }
  },

  // ========== 末尾追加的 summary request（compact 的"指令"）==========

  {
    "type": "user",
    "message": {
      "content": "CRITICAL: Respond with TEXT ONLY. Do NOT call any tools.\n\n- Do NOT use Read, Bash, Grep, Glob, Edit, Write, or ANY other tool.\n- You already have all the context you need in the conversation above.\n- Tool calls will be REJECTED and will waste your only turn — you will fail the task.\n- Your entire response must be plain text: an <analysis> block followed by a <summary> block.\n\nYour task is to create a detailed summary of the conversation so far ...\n\n[完整原文见本节末"附录：完整 compact prompt 原文"]"
    }
  }
]

CodeBlock Loading...

调用参数里还有三个关键设定：

system 固定一句话："You are a helpful AI assistant tasked with summarizing conversations."
thinkingConfig 显式禁用——摘要任务不需要 extended thinking
querySource = "compact"——标记这是一次"摘要调用"，不会再次触发 compact / snip 等上下文管理流程（避免递归）

摘要固定要求 9 个 section（完整 prompt 见本节末附录）：

TEXT

1. Primary Request and Intent
2. Key Technical Concepts
3. Files and Code Sections
4. Errors and fixes
5. Problem Solving
6. All user messages
7. Pending Tasks
8. Current Work
9. Optional Next Step

CodeBlock Loading...

如果 compact 自己都 prompt-too-long 怎么办

Compact 不会让会话彻底卡死，最多允许 3 次 PTL retry，流程如下：

flowchart TD
    A[构造 api_messages<br/>调用摘要 API] --> B{API 返回?}
    B -->|成功| C[提取 summary<br/>→ buildPostCompactMessages]
    B -->|PromptTooLong| D{attempt &lt; MAX_PTL_RETRIES=3?}
    D -->|否| Z[放弃: compact 失败<br/>上下文保持原状]
    D -->|是| E{error.tokenGap<br/>可解析?}
    E -->|是| F[按 tokenGap 精确砍:<br/>drop oldest rounds<br/>until ≥ tokenGap]
    E -->|否| G[兜底:<br/>drop oldest 20% rounds]
    F --> H{砍完后首条<br/>是 assistant?}
    G --> H
    H -->|是| I["插入 synthetic user marker:<br/>'[earlier conversation truncated<br/>for compaction retry]'"]
    H -->|否| J[保持不变]
    I --> K[attempt += 1]
    J --> K
    K --> A

CodeBlock Loading...

对应的伪代码：

PYTHON

# 伪代码
MAX_PTL_RETRIES = 3

for attempt in range(MAX_PTL_RETRIES):
    try:
        return call_compact(messages)
    except PromptTooLong as e:
        if e.token_gap:
            # 按 API 返回的 tokenGap 精确砍头
            messages = drop_oldest_groups_until_gap_covered(messages, e.token_gap)
        else:
            # 兜底：砍掉最旧 20% 的 round
            messages = drop_oldest_20_percent(messages)

        # 如果砍完后第一条变成 assistant，API 会拒（首条必须是 user）
        # 补一条 synthetic meta marker
        if first_is_assistant(messages):
            messages.insert(0, {
                "type": "user",
                "isMeta": True,
                "message": { "content": "[earlier conversation truncated for compaction retry]" }
            })

CodeBlock Loading...

几点值得注意的设计选择：

按 round 为单位丢，不是按单条 message。一个 round 粗略对应 "user input → assistant 的一系列 tooluse/thinking → 最终 text 回复"。按 round 丢能保证 tooluse 和 tool_result 不被切断，避免制造新的 API invariants 违规。
tokenGap-based 丢弃优先级更高。API 明确告诉你"你超了 X 个 token"时，准确丢掉 X 个 token 的 round 就够了；只有在拿不到 tokenGap 的情况下才兜底丢 20%。
synthetic meta marker 是 structural tax。[earlier conversation truncated for compaction retry] 这行不是给用户看的、也不是给模型"真"读的——它纯粹是为了满足"messages 第一条必须是 user"这个 API 约束。
3 次 retry 之上就放弃。如果砍三次还是 PTL，说明会话里有超大单条消息（比如一个 100K tokens 的 tool_result 没被 Tool Result Budget 拦下），这时 compact 已经无力回天，直接 abort。

压缩后主线程的 messages 结构

摘要模型返回后，客户端用 buildPostCompactMessages() 重建主线程 messages：

JSONC

[
  "<compact boundary>",
  "<compact summary user message（9 段式正文）>",
  "<post-compact attachments：最近 5 个文件、plan、invoked skills 等 8 种>",
  "<session start hooks>"
]

CodeBlock Loading...

Post-compact 恢复的预算是写死的常量（默认）：

最多恢复 5 个文件
全部 attachment 合计 50K token
单文件 5K token
单 skill 5K token
Skills 合计 25K token

场景

Session Memory Compact 返回 null，回退到传统 LLM Compact：

主线程先挂起
临时调一次模型，吐出 9 段式结构化摘要
用摘要 + 最近读过的几个关键文件 + 当前 plan 重建主线程 messages
会话继续，但 recent messages 原文不保留——这是它和 Session Memory Compact 最大的差别

附录：完整 compact prompt 原文

为了方便核对，下面是 getCompactPrompt() 最终拼出来、发给摘要模型的 user message 全文。它由 4 段拼接而成：

TEXT

NO_TOOLS_PREAMBLE
+ BASE_COMPACT_PROMPT          ← 其中嵌入 DETAILED_ANALYSIS_INSTRUCTION_BASE
+ [可选] customInstructions    ← 若通过 CLAUDE.md / compact instructions 配置了额外指令
+ NO_TOOLS_TRAILER

CodeBlock Loading...

完整原文（未配置 customInstructions 的情况）：

TEXT

CRITICAL: Respond with TEXT ONLY. Do NOT call any tools.

- Do NOT use Read, Bash, Grep, Glob, Edit, Write, or ANY other tool.
- You already have all the context you need in the conversation above.
- Tool calls will be REJECTED and will waste your only turn — you will fail the task.
- Your entire response must be plain text: an <analysis> block followed by a <summary> block.

Your task is to create a detailed summary of the conversation so far, paying close attention to the user's explicit requests and your previous actions.
This summary should be thorough in capturing technical details, code patterns, and architectural decisions that would be essential for continuing development work without losing context.

Before providing your final summary, wrap your analysis in <analysis> tags to organize your thoughts and ensure you've covered all necessary points. In your analysis process:

1. Chronologically analyze each message and section of the conversation. For each section thoroughly identify:
   - The user's explicit requests and intents
   - Your approach to addressing the user's requests
   - Key decisions, technical concepts and code patterns
   - Specific details like:
     - file names
     - full code snippets
     - function signatures
     - file edits
   - Errors that you ran into and how you fixed them
   - Pay special attention to specific user feedback that you received, especially if the user told you to do something differently.
2. Double-check for technical accuracy and completeness, addressing each required element thoroughly.

Your summary should include the following sections:

1. Primary Request and Intent: Capture all of the user's explicit requests and intents in detail
2. Key Technical Concepts: List all important technical concepts, technologies, and frameworks discussed.
3. Files and Code Sections: Enumerate specific files and code sections examined, modified, or created. Pay special attention to the most recent messages and include full code snippets where applicable and include a summary of why this file read or edit is important.
4. Errors and fixes: List all errors that you ran into, and how you fixed them. Pay special attention to specific user feedback that you received, especially if the user told you to do something differently.
5. Problem Solving: Document problems solved and any ongoing troubleshooting efforts.
6. All user messages: List ALL user messages that are not tool results. These are critical for understanding the users' feedback and changing intent.
7. Pending Tasks: Outline any pending tasks that you have explicitly been asked to work on.
8. Current Work: Describe in detail precisely what was being worked on immediately before this summary request, paying special attention to the most recent messages from both user and assistant. Include file names and code snippets where applicable.
9. Optional Next Step: List the next step that you will take that is related to the most recent work you were doing. IMPORTANT: ensure that this step is DIRECTLY in line with the user's most recent explicit requests, and the task you were working on immediately before this summary request. If your last task was concluded, then only list next steps if they are explicitly in line with the users request. Do not start on tangential requests or really old requests that were already completed without confirming with the user first.
                       If there is a next step, include direct quotes from the most recent conversation showing exactly what task you were working on and where you left off. This should be verbatim to ensure there's no drift in task interpretation.

Here's an example of how your output should be structured:

<example>
<analysis>
[Your thought process, ensuring all points are covered thoroughly and accurately]
</analysis>

<summary>
1. Primary Request and Intent:
   [Detailed description]

2. Key Technical Concepts:
   - [Concept 1]
   - [Concept 2]
   - [...]

3. Files and Code Sections:
   - [File Name 1]
      - [Summary of why this file is important]
      - [Summary of the changes made to this file, if any]
      - [Important Code Snippet]
   - [File Name 2]
      - [Important Code Snippet]
   - [...]

4. Errors and fixes:
    - [Detailed description of error 1]:
      - [How you fixed the error]
      - [User feedback on the error if any]
    - [...]

5. Problem Solving:
   [Description of solved problems and ongoing troubleshooting]

6. All user messages:
    - [Detailed non tool use user message]
    - [...]

7. Pending Tasks:
   - [Task 1]
   - [Task 2]
   - [...]

8. Current Work:
   [Precise description of current work]

9. Optional Next Step:
   [Optional Next step to take]

</summary>
</example>

Please provide your summary based on the conversation so far, following this structure and ensuring precision and thoroughness in your response.

There may be additional summarization instructions provided in the included context. If so, remember to follow these instructions when creating the above summary. Examples of instructions include:
<example>
## Compact Instructions
When summarizing the conversation focus on typescript code changes and also remember the mistakes you made and how you fixed them.
</example>

<example>
# Summary instructions
When you are using compact - please focus on test output and code changes. Include file reads verbatim.
</example>

REMINDER: Do NOT call any tools. Respond with plain text only — an <analysis> block followed by a <summary> block. Tool calls will be rejected and you will fail the task.

CodeBlock Loading...

这段 prompt 里几个值得留意的 prompt engineering 细节：

三次硬性禁止工具调用：开头 CRITICAL、段中 "Tool calls will be REJECTED"、结尾再 REMINDER。对 tool-calling 能力很强的模型，这种高频硬约束是必要的——只说一次，模型还是会忍不住去 Read 验证一下
强制两段式输出 <analysis> → <summary>：前者是模型"先想一遍"，后者才是真正会被写进 transcript 的摘要。分开是为了让思考过程不直接污染摘要正文
第 6 段 "All user messages" 是反失真防御：强制列出所有非 tool_result 的用户消息，避免模型只捡自己愿意记的事
第 9 段要求 "direct quotes"：next step 必须带原文摘录，防止摘要后出现"任务漂移"——compact 最怕的就是模型在总结时悄悄改了用户的意图
customInstructions 作为尾插槽：用户可以通过 CLAUDE.md 或专门的 compact instructions 给这个 prompt 加后缀，比如 "focus on typescript code changes" / "include test output verbatim"

分层设计的业务启示

如果把五级 + 两条子路径放到一张表里看：

层级	触发时机	作用对象	代价	是否调模型
Tool Result Budget	工具返回 & 发请求前	单个 tool_result	极低	否
Snip	每请求 / 模型主动	整条 message	低	否（模型主导）
Microcompact (time)	60min 静默后	旧 tool_result.content	低	否
Microcompact (cached)	每请求（支持 cache）	服务端缓存视图	极低	否
Context Collapse	每请求	分段归档 + summary placeholder	中	是（摘要生成）
Session Memory Compact	Autocompact 首选	早期历史 → summary.md	中（读磁盘）	后台 agent 维护文件
传统 LLM Compact	Autocompact 兜底	全量历史 → 9 段摘要	高（主线程级 LLM 调用）	是

几点值得注意的设计选择：

"便宜的事情先做"。Tool Result Budget 只是字符数比较 + 写文件，几乎零成本；LLM Compact 是主线程级别的 API 调用，是重活。流水线把廉价、精细的处理放前面，把昂贵、粗糙的处理放后面，典型的 cost-aware pipeline。
"能不调模型就不调"。直到最后一级 Autocompact 的兜底路径，才会真的花一次 API 调用做摘要。前面所有级别要么是机械替换、要么是删 UUID、要么是服务端 cache 指令。
"保住最近的原文"是一个清晰的价值排序。Session Memory Compact 的所有复杂性——后台持续维护 summary.md、lastSummarizedMessageId bookkeeping、API invariants 修复——都在保护同一个目标：recent messages 原样保留。因为开发者的"正在干的事"往往需要原文细节，而"上下文铺垫"只要知识性的摘要就够。
每一级都是 messages 数组的修改。对 Anthropic API 来说没有什么神秘压缩参数，所有机制的落点都在 payload 里那个数组。唯一的例外是 Cached Microcompact 的 cache_edits 和 API-native Context Management（clear_thinking_* / clear_tool_uses_*），那是服务端层面的约定。

写在最后

如果你在做 AI 应用的上下文管理，这条流水线至少给了三个直接能借鉴的点：

分层而不是单点。不要只有一个"到阈值就压"的大杀器。不同规模、不同类型的膨胀适合不同代价的处理。
保住原文优先于总结。模型读原文的效果几乎总是好过读自己的摘要，能保留就保留。
压缩机制本身也要有自救。你的 compact 逻辑调用的那个 API 请求，如果它自己 prompt-too-long 了怎么办？Claude Code 专门写了 3 次 PTL retry + synthetic marker，值得借鉴。

Claude Code 上下文管理

Claude Code 上下文管理

开场：我们到底在聊什么

先对齐一个共识：messages 数组才是唯一真相

总览：五级流水线

第一级：Tool Result Budget（工具结果预算）

是什么

什么时候触发

触发后发生什么

messages 数组前后对比

场景

第二级：Snip（精准裁剪）

是什么

删除单位：一整个 user turn

什么时候触发

触发后发生什么

完整示例：从多轮对话到一次 Snip

第 0 步：原始 messages 数组（发给 API 的版本）

第 1 步：模型主动调用 SnipTool

第 2 步：Snip 执行后，下一轮发给 API 的 messages 数组

值得注意的几件事

为什么这么设计

第三级：Microcompact（轻量重写）

是什么

什么时候触发

触发后发生什么

完整示例：Time-based 路径

第 0 步：离开前的 messages 数组

第 1 步：用户 70 分钟后回来

第 2 步：Microcompact 扫一遍 messages 数组

第 3 步：变换后的 messages 数组

值得注意的几件事

Cached 路径的差异

本质上就是缓存的冷热区分

场景小结

第四级：Context Collapse

第五级：Autocompact（重度兜底）

是什么

什么时候触发

子路径 A：Session Memory Compact（结构化摘要）

是什么

Session Memory 文件的存储位置

10 段式模板

什么时候触发

lastSummarizedMessageId 的生命周期

保留窗口算法详解

post-compact attachments 全景

完整示例：两小时会话压缩前后

第 0 步：summary.md 当前内容（磁盘上）

第 1 步：压缩前的 messages 数组（简化示意）

第 2 步：Autocompact 触发，选 Session Memory Compact 路径

第 3 步：压缩后的 messages 数组

值得注意的几件事

子路径 B：传统 LLM Compact（9 段式摘要）

是什么

什么时候触发

触发后发生什么

如果 compact 自己都 prompt-too-long 怎么办

压缩后主线程的 messages 结构

场景

附录：完整 compact prompt 原文

分层设计的业务启示

写在最后

开场：我们到底在聊什么

先对齐一个共识：messages 数组才是唯一真相

总览：五级流水线

第一级：Tool Result Budget（工具结果预算）

是什么

什么时候触发

触发后发生什么

messages 数组前后对比

场景

第二级：Snip（精准裁剪）

是什么

删除单位：一整个 user turn

什么时候触发

触发后发生什么

完整示例：从多轮对话到一次 Snip

第 0 步：原始 messages 数组（发给 API 的版本）

第 1 步：模型主动调用 SnipTool

`lastSummarizedMessageId` 的生命周期

第 0 步：`summary.md` 当前内容（磁盘上）

`lastSummarizedMessageId` 的生命周期

第 0 步：`summary.md` 当前内容（磁盘上）