Opening: What Are We Actually Talking About
If you've built any application based on the Claude API / OpenAI API, you've probably stumbled into the same pitfall: when context grows to a certain length, either you get rejected with prompt-too-long, or your wallet gets drained by cache misses, or the model starts to "lose its memory."
People usually write a simple compact layer themselves: drop a segment of history when hitting a threshold, or insert a summary. Claude Code is different—it's not one mechanism, but a five-level pipeline. Each level handles different scenarios, different costs, and different fallback timings, and each ultimately lands on that messages array sent to the API.
This article completely disassembles this pipeline, telling you for each level:
- What it is
- When it triggers
- What happens after triggering
- How the messages array changes before and after
- A real-world scenario
First, Let's Align on a Consensus: The Messages Array Is the Only Truth
For the Anthropic API, all "context management" boils down to the client deciding one thing—
Which set of
messagesam I going to send this turn.
The API doesn't know "your conversation history," nor does it know "how many times you've compacted." It only knows that messages array in your request payload this time. Each message is either user or assistant, and content might be a string, or a mix of blocks like text, tool_use, tool_result, or thinking.
Claude Code's context management, put simply, is modifying this array layer by layer on the client side, then sending the modified result to the API. In rare cases, it also attaches some server-side Context Management directives (like cache_edits / clear_tool_uses_20250919), but that's the icing on the cake, not the main trunk.
Remember this: Every level discussed below is reconstructing the messages array.
Overview: The Five-Level Pipeline
Before each request to the API, Claude Code runs in this order:
The design intuition is:
- The mechanisms at the front are cheapest and most fine-grained, only intercepting or replacing small chunks at the entrance
- The further back you go, the higher the cost and coarser the action, with the final step only calling a model to generate a summary when necessary
- Each level takes over only when the previous level didn't solve the problem, avoiding "doing heavy work right away"
Let's break them down one by one.
Level 1: Tool Result Budget
What It Is
This is entrance throttling, not compression of history.
When a tool (like Bash, Read, Grep) finishes executing and its tool_result is about to be stuffed into the messages array, it passes through a budget check first: if it's too large, the original text isn't allowed in.
When It Triggers
Two time points:
- Per-result moment: Right after each tool executes, when the result is being packaged into a
tool_resultblock - Aggregation moment: Before each API request, a total budget check is performed on all
tool_results in the entire user message (mainly to prevent stacking from parallel tools returning in the same turn)
What Happens After Triggering
Two-level thresholds (default values):
- Single result 50K characters: If exceeded, persist to disk at
tool-results/<tool_use_id>.txt, leaving only a "reference message" in messages - Single message total 200K characters for all tool_results: If exceeded, pick the largest one and apply the same replacement until back within budget
The replaced content follows a fixed template:
Before/After Comparison of the Messages Array
Scenario
You have Claude run cat huge.log, stdout is 500KB. If this 500KB original text went directly into context:
- This round's API call would eat up most of the window
- Every subsequent round would resend it (no client cache can solve this)
- The model can't actually read such a long log anyway
What Tool Result Budget does: only let a 2KB preview enter context, the model sees "the file is there," and if it really needs details later, it can precisely re-read the corresponding segment via Read(offset, limit). A lightweight disk layer replaces a heavy token layer.
Level 2: Snip (Precision Trimming)
What It Is
An empowerment mechanism for active deletion: attach a short ID to each user input, letting the model reference this ID to say "I don't want this whole turn anymore," then physically remove everything from that user input to the next user input (including the user message itself, subsequent assistant thinking, all tooluse / toolresult) from the messages array.
This is the only model-driven level in the entire pipeline—other levels are automatic decisions made by the client.
Deletion Unit: An Entire User Turn
The most important thing to understand about Snip is that it works by user turn as the unit, not by single message.
A user turn looks like this:
When the model calls SnipTool on a certain ID, everything from that user input start to before the next user input disappears from subsequent API requests.
When It Triggers
Three time points:
- Before each API request: On the copy of messages about to be sent, append
[id:<shortID>]tail tags to all non-meta user messages (tool_result class user messages don't count as "real user input," so they don't get IDs) - Context grows ~10K tokens without snipping: Inject a nudge attachment reminding the model "you can snip"
- Model actively calls SnipTool: The actual deletion executes
What Happens After Triggering
- Short IDs are derived from message UUIDs (first 10 hex chars converted to base36 first 6 chars), each user input has a stable, short, model-easy-to-recite ID
- Model calls SnipTool passing one or more IDs, client collects all UUIDs for the corresponding user turn, deletes them from in-memory array
removedUuidswritten to transcript boundary; replayed on resume for persistence- After deletion, backtrack and fix affected
parentUuids to avoid dangling chains
Key detail: [id:xxxxxx] tags are only added to the "copy sent to API", not written back to original storage. The user's original words in the transcript remain clean; only the model-visible copy carries the tag.
Complete Example: From Multi-turn Conversation to a Single Snip
Assume the session has accumulated the following history (omitting some fields for readability). Note that Claude Code appends [id:...] tags at the end of every real user input before sending to API—tool_result type user messages don't get tags.
Step 0: Original Messages Array (Version Sent to API)
At this point, the model can see two real user input IDs in this turn: abc123 (TODO research) and def456 (login bug fix). Since the user explicitly said "forget about the TODOs," all content about Turn 1 (the list of 23 TODOs, full login.ts text, and corresponding thinking) is now pure token burden for the subsequent bug fix work.
Step 1: Model Actively Calls SnipTool
Step 2: Messages Array Sent to API in Next Turn After Snip Execution
The entire Turn 1 segment (user input + two assistant tooluses + two toolresults, 5 messages total) is physically deleted:
Notable Points
- Deletion is paired.
toolu_01'stool_useand its correspondingtool_resultdisappear together, same fortoolu_02. This prevents API-side errors like "toolresult found but no tooluse" or vice versa. [id:def456]is untouched. Snip precisely targets the user turn where abc123 lives, without harming subsequent turns.- Transcript on-disk version remains unchanged. If you resume this session, Claude Code replays the same deletion based on
removedUuids, keeping the model-visible view consistent—but the user's "original words" themselves always remain on disk for auditability. - Significant context savings. Turn 1's Grep results + login.ts full text is roughly 6~8K tokens, saved in one Snip, and this is a precision preservation of original text, not a summary.
Why This Design
The weakness of traditional compact is "one-size-fits-all summarization"—coarse granularity, easily losing useful original text along with the chaff. Snip is the opposite: the model itself judges which turn is obsolete, precisely excising the entire turn, while remaining recent messages stay original. They complement each other:
- When you just pivot direction once and want to discard a detour, use Snip
- When the entire context is overloaded and there's no obvious "this turn is obsolete" boundary, use Microcompact / Autocompact later on
Level 3: Microcompact (Lightweight Rewrite)
What It Is
Lightweight compression targeting only old tool results. It doesn't summarize conversation, doesn't call models, doesn't modify user messages—it only does one thing: replace old large tool_result.content with placeholders or cache edit instructions.
It only processes results from these tools: Read, Bash, Grep, Glob, WebSearch, WebFetch, Edit, Write. User text, model thinking, plan, attachments—it touches none of them.
When It Triggers
Two independent paths:
Path A: Time-based Microcompact
- Disabled by default, when enabled:
- More than 60 minutes since last assistant message +
- On main thread +
- Check once before each request
Path B: Cached Microcompact
- Feature flag enabled +
- Model supports cache editing +
- On main thread +
- Check once before each request
What Happens After Triggering
Time-based path—directly modifies local messages:
- Find all "compressible tool" tool_results by tool id
- Keep the most recent 5, replace remaining
contentoriginal text with literal string[Old tool result content cleared] - Also reset cached microcompact module state (to avoid cache references to invalidated tool ids)
Cached path—local messages unchanged, instead adds cache_edits at API layer:
- Local array shows those old tool_results looking untouched
- But when sending to Anthropic, the payload includes an extra
cache_editsdirective telling the server "you can delete segment xxx from your cache" - Benefit is prompt cache prefix is preserved as much as possible, avoiding the "one move and all cache misses" of time-based approach
Additionally there's a layer of API-native Context Management, not done by client but natively supported by Anthropic API:
These blocks are added to API parameters, and server-side automatically cleans tool_use content when exceeding 180K input tokens.
Complete Example: Time-based Path
To keep the example readable, here's a scaled-down scenario—assuming keepRecent = 2 (default is 5). Scenario: You have Claude research a project for you, ran 3 tools in succession, then went to lunch, came back 70 minutes later to continue asking questions.
Step 0: Messages Array Before Leaving
Step 1: User Returns After 70 Minutes
At this moment, Time-based Microcompact trigger conditions are met: main thread + has previous assistant + gap > 60 minutes.
Step 2: Microcompact Scans the Messages Array
Finds all "compressible tool" tool_results from old to new:
Keeps the most recent keepRecent = 2 (toolu_02 / toolu_03), replaces remaining content with placeholder.
Step 3: Transformed Messages Array
Notable Points
tool_use_idis not deleted—toolu_01's tooluse (with parameterspattern: "**/*.ts") is fully preserved, only the corresponding toolresultcontentis replaced with a literal string. The API-sidetool_use↔tool_resultpairing remains valid.- Model can still judge "which tools were run". It can see
toolu_01was aGlob("**/*.ts")call, just the specific return is obsolete. If it really needs it later, it can call Glob again to re-fetch. - User text, assistant thinking untouched—Microcompact only targets
contentof tool_results. - Volume gains. In this example toolu01's Glob output is roughly 3KB, a placeholder is 31 bytes. In real sessions you likely have 10+ old toolresults, each several KB to tens of KB, saving a solid 5K~50K tokens.
Differences in the Cached Path
The key to the Cached path is local messages completely unchanged, replacement happens on the server-side cache. Comparison below (using the above scenario but taking the Cached path):
The benefit is: prompt cache prefix won't be interrupted. The time-based approach of directly modifying local messages changes the cache key, causing all cache hits to drop to zero next request. The Cached path lets the server "delete internally," releasing token cost while preserving cache hit rate.
Essentially a Hot/Cold Cache Distinction
Stepping back, the division of labor between the two paths is actually a cache state machine:
- Hot cache (cache still valuable) → Take Cached path, API layer
cache_editsfor fine-grained editing - Cold cache (60 minutes inactive, cache likely expired) → Take Time-based path, abandon cache hits for context space
And in the source code these two paths are short-circuit relationship—Time-based checks first, once hit it returns directly, no longer going through Cached. This order is also the natural hot/cold inference: since you've already judged "cache is cold," doing cache editing is meaningless.
Scenario Summary
- Active use: Cached path silently trims server-side cache in background, local experience is 0 change, but requests are cheaper
- Returning after long pause: Time-based path directly clears old tool_results locally, trading cache hits for context space
- Approaching token limit: API-native
clear_tool_uses_20250919as safety net, server automatically cleans at 180K threshold
Level 4: Context Collapse
This level exists in the query pipeline, positioned after Microcompact and before Autocompact. Enabling it suppresses active Autocompact—in Claude Code's design, Collapse and Autocompact compete for the same headroom, so when Collapse is on shouldAutoCompact() directly returns false, letting Collapse take over.
From transcripts you can see Collapse drops two types of records:
marble-origami-commit: Append-only splice instruction, recording "how to fold a segment of history into a summary placeholder," containingcollapseId/summaryUuid/summaryContent/firstArchivedUuid/lastArchivedUuidmarble-origami-snapshot: Last-wins staged state snapshot, containingstagedspans /armedflag /lastSpawnTokens
These record structures imply Collapse is doing "segmented archiving + summary placeholder"—roughly working by scoring early history, selecting, and packaging into an archived unit with summary, letting that history be replaced by a placeholder in subsequent requests. Finer details like staged span selection algorithms, summary placeholder specific formats, trigger threshold chains are not expanded in this article.
Level 5: Autocompact (Heavy Fallback)
What It Is
The last line of defense when the previous four levels couldn't compress context enough. It doesn't do compression itself, but chooses one of two sub-paths:
- Preferred: Session Memory Compact (reads a continuously maintained
summary.md) - Fallback: Traditional LLM Compact (temporarily calls a model once for 9-section summary)
When It Triggers
When shouldAutoCompact() judges "context approaching token limit." Note that if Context Collapse is enabled, this step is skipped directly (letting Collapse handle it).
Post-trigger flow:
Sub-path A: Session Memory Compact (Structured Summary)
What It Is
Core idea: Don't wait until context explodes to start summarizing, continuously maintain a structured summary file in the background, and read this file directly when explosion happens.
The benefits are clear:
- Zero API cost when triggering compression—no need to temporarily call a model, just read disk
- More stable summary quality—updated every once in a while in background, more systematic coverage than "temporarily generating one"
- Can continuously precipitate cross-turn information—error corrections made, project cognition, can all accumulate in one file
Session Memory File Storage Location
Filename summary.md, full path:
Note this is per session, not project-level shared. The reason is straightforward—different sessions do different things, sharing one would cause cross-interference.
10-Section Template
summary.md is not free-form diary, but background agent filling in blanks according to fixed template. Initialized with an empty template, subsequent extractions only update body text. Complete template has 10 sections:
| # | Section | What this section holds (guidance) |
|---|---|---|
| 1 | Session Title | An information-dense 5-10 word session title, no filler words |
| 2 | Current State | What is actively being worked on right now? Unfinished tasks, what to do next |
| 3 | Task specification | What is the user trying to build? Any design decisions or explanatory context |
| 4 | Files and Functions | Which files matter? What do they contain, why are they relevant |
| 5 | Workflow | Which bash commands are commonly used, in what order, how to interpret output |
| 6 | Errors & Corrections | What errors were encountered, how were they fixed, what did the user correct, which paths don't work |
| 7 | Codebase and System Documentation | Important system components, how they collaborate |
| 8 | Learnings | What approaches worked, what didn't, what to avoid (don't repeat content from other sections) |
| 9 | Key results | If user explicitly requested a specific result (answer, table, document), preserve it verbatim here |
| 10 | Worklog | Step by step what was attempted, minimal summary per step |
Note: Distinguish this from the 9-section summary format of Traditional LLM Compact later—that's the summary format temporarily generated by the model for the Autocompact fallback path, with different Sections (e.g., "All user messages", "Current Work", "Optional Next Step" etc. more conversation-context-oriented items). Session Memory's 10-section template is more "project memory" oriented.
When It Triggers
Two layers: When does background update summary.md vs When does Autocompact read it.
Background update triggers (default thresholds):
- First initialization: Current messages reach 10000 tokens
- Incremental update condition:
(token growth ≥ 5000 && tool calls ≥ 3) || (token growth ≥ 5000 && last round had no tool calls) - Only runs on
querySource === 'repl_main_thread', subagent / teammate don't run
Autocompact call timing (Sub-path A first-try entry):
shouldAutoCompact()judges compression needed- Wait for any ongoing background extraction to finish
- Read
summary.md; if file doesn't exist or is still empty template, return null to yield to fallback - Otherwise execute compression
lastSummarizedMessageId Lifecycle
This is one of Session Memory Compact's core states, determining "which messages after this belong to the retention zone." Without understanding this, you can't understand the retention algorithm below.
Semantics: UUID of the last message absorbed by summary.md
That is, messages with uuid ≤ lastSummarizedMessageId have already been digested by Session Memory; new messages (uuid > lastSummarizedMessageId) are the increment to be processed next extraction.
Update timing and value
After background extraction ends, not updated unconditionally, but has a safety gate:
Why this gate?
Because lastSummarizedMessageId is used by the compact phase to calculate "retention zone start." If updated at the moment "assistant just initiated tooluse, toolresult hasn't returned yet," subsequent compact might classify tooluse as "already summarized" and toolresult as "retention zone"—the API request would error 400 with "toolresult can't find corresponding tooluse." This gate ensures updates happen at natural breakpoints in conversation.
Retention Window Algorithm Details
What calculateMessagesToKeepIndex() does in source code, written as pseudocode:
Key points:
- Minimum is AND, not OR. Must have "enough token count and enough text-block count." This avoids degenerate scenarios—like 50K pure tool_result (a bunch of large file Reads) satisfying tokens but only 2 messages with actual text, leaving the model with almost no conversational continuity. The text-block minimum ensures "at least 5 messages actually talking" remain.
- Maximum is hard stop. Once totalTokens exceeds 40K, loop breaks and stops expanding further. This is the capacity ceiling of the retention zone, not a minimum guarantee.
- Scan direction is "start moving backwards." The end (messages array tail) never moves; what moves is the start. Each
i--pulls an earlier message into the retention zone. - Don't cross compact boundary floor. If compact already happened before, retention zone forward expansion stops at the previous compact boundary.
- Finally API invariants alignment: If start happens to land on toolresult but the paired tooluse is in an earlier assistant message, that assistant is pulled forward too. Similarly handles thinking block merge requirements.
Post-Compact Attachments Panorama
Many think compact output is just "boundary + summary + recent messages" three-piece. Actually a string of attachments hangs behind, and they are the key for the model to quickly continue working after compact.
buildPostCompactMessages() assembly order is:
These 8 attachment types have different trigger conditions:
| # | Attachment Type | Injected Content | When it appears |
|---|---|---|---|
| 1 | file_reference | Recently read files, verbatim excerpt | Have recent Read files not in retention zone |
| 2 | plan_file_reference | Current session's plan file | Have active plan |
| 3 | invoked_skills | Skills activated this session | Activated any skill |
| 4 | plan_mode | Plan mode status hint | Currently in plan mode |
| 5 | task_status | Background running agent / task status | Have background async agent running |
| 6 | deferred_tools_delta | Tool list changes vs pre-compact | Tool list changed |
| 7 | agent_listing_delta | Agent list changes | Agent list changed |
| 8 | mcp_instructions_delta | MCP instruction changes | MCP instructions changed |
Budgets (default constants):
- File attachments: max 5, total ≤ 50K tokens, single file ≤ 5K tokens
- Skill attachments: total ≤ 25K tokens, single skill ≤ 5K tokens
In other words, Claude Code does strict budget control on "what can be in the inventory"—summary tells you "what are we doing," attachments guarantee you "raw materials to continue." The division is very clear.
Complete Example: Two-Hour Session Before/After Compression
Scenario: You and Claude discussed project auth refactoring for 2 hours, ran 40+ tool calls in between, modified a dozen files. Context grew near threshold, recent rounds are implementing AuthSession.refresh(). Background summary.md has been continuously updating.
Step 0: Current summary.md Content (On Disk)
After filling the 10-section template, it looks roughly like this (showing example fills for first few sections, real file has all 10 sections):
Note this file is fixed template + background agent filling, not free-form diary. Each section below has an italic guidance line (like "What is actively being worked on right now?"), background agent fills based on these guidances.
Step 1: Messages Array Before Compression (Simplified Illustration)
Step 2: Autocompact Triggers, Chooses Session Memory Compact Path
Algorithm does three things:
- Cut boundary: Based on
lastSummarizedMessageId = u128, u129 and after belong to "retention zone" - Adjust API invariants: Retention zone first item is assistant + tooluse, paired toolresult also in retention zone—ok, no need to prepend
- Generate summary message: Wrap
summary.mdbody into a user message
Step 3: Messages Array After Compression
Notable Points
- The bolded line: "recent messages preserved verbatim" is the essential difference between SM-compact and traditional LLM compact. The model receives your actual words, actual tools run, actual results—not an LLM paraphrase.
- Summary comes from file, not temporary model call. Because summary.md is continuously maintained in background, at the moment Autocompact triggers the client doesn't need to initiate another API call, just reads disk. This is why it's first try—faster and cheaper than the fallback LLM Compact.
- API invariants won't break. Retention zone toolresults can always find corresponding tooluses. The algorithm checks and prepends the assistant message containing the tool_use if necessary, even if its original position was before
lastSummarizedMessageId. - preservedSegment is the hook for relinking. The
headUuid / anchorUuid / tailUuidon the compact boundary record "where to where is the retention segment." Resume uses these three UUIDs to reconnect the compacted view with the original transcript. - Post-compact attachments are not part of the summary, they're the "inventory." These attachments' role is: summary tells you "we're modifying AuthSession," attachments guarantee AuthSession.ts's latest source is right there in context, no need to Read again. This budget is fixed 50K tokens, 5K per file, max 5 files.
Sub-path B: Traditional LLM Compact (9-Section Summary)
What It Is
The old path taken when Session Memory Compact fails (most common reason: summary.md hasn't reached initialization threshold before exploding). The approach is straightforward: temporarily call a model once, let it generate a 9-section structured summary of the current session, then replace original history with this summary.
When It Triggers
When trySessionMemoryCompaction() returns null.
What Happens After Triggering
The core is a "conversation within a conversation"—client constructs a new API request using current messages:
The actual api_messages array sent to the summarization model looks roughly like this—the big middle section is the complete original history of the current session, with brackets indicating omitted portions:
Three key settings in call parameters:
systemfixed sentence: "You are a helpful AI assistant tasked with summarizing conversations."thinkingConfigexplicitly disabled—summary task doesn't need extended thinkingquerySource = "compact"—marks this as a "summary call," won't trigger context management flows like compact / snip again (avoiding recursion)
Model should return <analysis>...</analysis><summary>...</summary> two sections of plain text. Client extracts <summary> part as 9-section summary body, then calls buildPostCompactMessages() to assemble new main thread messages (shares same assembly function with Session Memory Compact).
Summary fixedly requires 9 sections (full prompt see appendix at end of this section):
What If Compact Itself Gets prompt-too-long
This is an easily overlooked but critical self-rescue mechanism. That "conversation within a conversation" itself is an API call, and it can also return PTL error—especially when the session is already huge when triggering compact, sending "current history + long prompt" together easily exceeds token limits.
Compact won't let the session die completely, allowing up to 3 PTL retries, flow as follows:
Corresponding pseudocode:
Notable design choices:
- Drop by round as unit, not by single message. A round roughly corresponds to "user input → assistant's series of tooluse/thinking → final text reply." Dropping by round ensures tooluse and tool_result aren't severed, avoiding creating new API invariants violations.
- TokenGap-based dropping has higher priority. When API explicitly tells you "you exceeded by X tokens," accurately dropping rounds totaling X tokens is enough; only when tokenGap is unavailable fall back to dropping 20%.
- Synthetic meta marker is structural tax.
[earlier conversation truncated for compaction retry]is not for users to see, nor for model to "really" read—it's purely to satisfy the API constraint that "first message must be user." - Give up after 3 retries. If still PTL after cutting three times, it means there's a super large single message in the session (like a 100K token tool_result that Tool Result Budget didn't catch), and compact is powerless, so abort directly.
Main Thread Messages Structure After Compression
After summary model returns, client uses buildPostCompactMessages() to rebuild main thread messages:
The only essential difference from Session Memory Compact: No messagesToKeep segment—Traditional LLM Compact replaces all history with summary, recent messages verbatim not preserved. Others (boundary / summary / attachments / hooks order and budgets) are completely identical, because they use the same assembly function.
Post-compact recovery budget is hardcoded constants (default):
- Max 5 files recovered
- All attachments total 50K tokens
- Single file 5K tokens
- Single skill 5K tokens
- Skills total 25K tokens
Scenario
New session, discussed a very compact problem with Claude—ran a dozen large tool calls within minutes, context directly hits limit. At this point summary.md hasn't reached initialization threshold (10K tokens is the "stable enough" threshold, but your session is "dense in short time").
Session Memory Compact returns null, falls back to Traditional LLM Compact:
- Main thread pauses first
- Temporarily call model once, spit out 9-section structured summary
- Rebuild main thread messages with summary + several recently read key files + current plan
- Session continues, but recent messages verbatim not preserved—this is its biggest difference from Session Memory Compact
Appendix: Complete Compact Prompt
For easy verification, below is the full user message that getCompactPrompt() ultimately assembles and sends to the summarization model. It's拼接 by 4 segments:
Full text (without customInstructions configuration):
Several prompt engineering details worth noting in this prompt:
- Three hard prohibitions on tool calls: Opening
CRITICAL, mid-section "Tool calls will be REJECTED", and endingREMINDER. For models with strong tool-calling capabilities, this high-frequency hard constraint is necessary—say it only once, and the model will still be tempted to Read to verify - Forced two-part output
<analysis>→<summary>: Former is model "thinking first," latter is what actually gets written to transcript. Separated to prevent thinking process from directly polluting summary body - Section 6 "All user messages" is anti-distortion defense: Forces listing all non-tool_result user messages, preventing the model from only picking what it wants to remember
- Section 9 requires "direct quotes": Next step must include verbatim excerpts, preventing "task drift" after compact—what compact fears most is the model subtly changing user intent during summarization
- customInstructions as tail slot: Users can add suffixes to this prompt via
CLAUDE.mdor dedicated compact instructions, like "focus on typescript code changes" / "include test output verbatim"
Business Insights from the Layered Design
If we put the five levels + two sub-paths in one table:
| Level | Trigger Timing | Target Object | Cost | Calls Model? |
|---|---|---|---|---|
| Tool Result Budget | Tool returns & before sending request | Single tool_result | Extremely low | No |
| Snip | Per request / Model-initiated | Entire message | Low | No (Model-driven) |
| Microcompact (time) | After 60min silence | Old tool_result.content | Low | No |
| Microcompact (cached) | Per request (cache supported) | Server-side cache view | Extremely low | No |
| Context Collapse | Per request | Segmented archive + summary placeholder | Medium | Yes (Summary generation) |
| Session Memory Compact | Autocompact preferred | Early history → summary.md | Medium (Disk read) | Background agent maintains file |
| Traditional LLM Compact | Autocompact fallback | Full history → 9-section summary | High (Main thread LLM call) | Yes |
Notable design choices:
"Do cheap things first." Tool Result Budget is just character counting + file writing, almost zero cost; LLM Compact is main thread-level API call, heavy work. Pipeline puts cheap, fine-grained processing at front, expensive, coarse processing at back—typical cost-aware pipeline.
"Don't call model if possible." Until the final fallback path of Autocompact, no API call is spent on summarization. All previous levels are either mechanical replacement, UUID deletion, or server-side cache instructions.
"Preserve recent verbatim" is a clear value ordering. All complexity of Session Memory Compact—background continuously maintaining
summary.md,lastSummarizedMessageIdbookkeeping, API invariants repair—protects the same goal: recent messages preserved verbatim. Because the developer's "currently doing" often needs verbatim details, while "context铺垫" only needs knowledge-level summary.Every level is modification of the messages array. For Anthropic API there's no mysterious compression parameter, all mechanisms land on that array in the payload. Only exceptions are Cached Microcompact's
cache_editsand API-native Context Management (clear_thinking_*/clear_tool_uses_*), which are server-level conventions.
Closing Remarks
If you're building AI application context management, this pipeline gives at least three directly borrowable points:
- Layer rather than single point. Don't have just one "compress when threshold hit" big hammer. Different scales, different types of bloat suit different cost treatments.
- Preserve verbatim over summarization. Model reading verbatim almost always outperforms reading its own summary; preserve when possible.
- Compression mechanism itself needs self-rescue. What if that API request your compact logic calls itself prompt-too-longs? Claude Code specifically wrote 3 PTL retries + synthetic marker, worth borrowing.