Claude Code Context Management

Opening: What Are We Actually Talking About

If you've built any application based on the Claude API / OpenAI API, you've probably stumbled into the same pitfall: when context grows to a certain length, either you get rejected with prompt-too-long, or your wallet gets drained by cache misses, or the model starts to "lose its memory."

People usually write a simple compact layer themselves: drop a segment of history when hitting a threshold, or insert a summary. Claude Code is different—it's not one mechanism, but a five-level pipeline. Each level handles different scenarios, different costs, and different fallback timings, and each ultimately lands on that messages array sent to the API.

This article completely disassembles this pipeline, telling you for each level:

What it is
When it triggers
What happens after triggering
How the messages array changes before and after
A real-world scenario

First, Let's Align on a Consensus: The Messages Array Is the Only Truth

For the Anthropic API, all "context management" boils down to the client deciding one thing—

Which set of messages am I going to send this turn.

The API doesn't know "your conversation history," nor does it know "how many times you've compacted." It only knows that messages array in your request payload this time. Each message is either user or assistant, and content might be a string, or a mix of blocks like text, tool_use, tool_result, or thinking.

Claude Code's context management, put simply, is modifying this array layer by layer on the client side, then sending the modified result to the API. In rare cases, it also attaches some server-side Context Management directives (like cache_edits / clear_tool_uses_20250919), but that's the icing on the cake, not the main trunk.

Remember this: Every level discussed below is reconstructing the messages array.

Overview: The Five-Level Pipeline

Before each request to the API, Claude Code runs in this order:

TEXT

Tool Result Budget → Snip → Microcompact → Context Collapse → Autocompact
                                                                  │
                                            ┌─────────────────────┤
                                            ▼                     ▼
                                 Session Memory Compact    Traditional LLM Compact
                                       (Preferred Path)         (Fallback Path)

The design intuition is:

The mechanisms at the front are cheapest and most fine-grained, only intercepting or replacing small chunks at the entrance
The further back you go, the higher the cost and coarser the action, with the final step only calling a model to generate a summary when necessary
Each level takes over only when the previous level didn't solve the problem, avoiding "doing heavy work right away"

Let's break them down one by one.

Level 1: Tool Result Budget

What It Is

This is entrance throttling, not compression of history.

When a tool (like Bash, Read, Grep) finishes executing and its tool_result is about to be stuffed into the messages array, it passes through a budget check first: if it's too large, the original text isn't allowed in.

When It Triggers

Two time points:

Per-result moment: Right after each tool executes, when the result is being packaged into a tool_result block
Aggregation moment: Before each API request, a total budget check is performed on all tool_results in the entire user message (mainly to prevent stacking from parallel tools returning in the same turn)

What Happens After Triggering

Two-level thresholds (default values):

Single result 50K characters: If exceeded, persist to disk at tool-results/<tool_use_id>.txt, leaving only a "reference message" in messages
Single message total 200K characters for all tool_results: If exceeded, pick the largest one and apply the same replacement until back within budget

The replaced content follows a fixed template:

TEXT

<persisted-output>
Output too large (317842 chars). Full output saved to: /project/.claude/tool-results/toolu_abc.txt

Preview (first 2 KB):
...first 2KB of original text...
</persisted-output>

Before/After Comparison of the Messages Array

JSONC

// Before transformation — Original tool_result
{
  "type": "user",
  "message": {
    "content": [
      {
        "type": "tool_result",
        "tool_use_id": "toolu_abc",
        "content": "<300KB of bash stdout>"
      }
    ]
  }
}

// After transformation — Preview + disk path
{
  "type": "user",
  "message": {
    "content": [
      {
        "type": "tool_result",
        "tool_use_id": "toolu_abc",
        "content": "<persisted-output>Output too large (317842 chars). Full output saved to: /project/.claude/tool-results/toolu_abc.txt\n\nPreview (first 2 KB):\n...\n</persisted-output>"
      }
    ]
  }
}

Scenario

You have Claude run cat huge.log, stdout is 500KB. If this 500KB original text went directly into context:

This round's API call would eat up most of the window
Every subsequent round would resend it (no client cache can solve this)
The model can't actually read such a long log anyway

What Tool Result Budget does: only let a 2KB preview enter context, the model sees "the file is there," and if it really needs details later, it can precisely re-read the corresponding segment via Read(offset, limit). A lightweight disk layer replaces a heavy token layer.

Level 2: Snip (Precision Trimming)

What It Is

An empowerment mechanism for active deletion: attach a short ID to each user input, letting the model reference this ID to say "I don't want this whole turn anymore," then physically remove everything from that user input to the next user input (including the user message itself, subsequent assistant thinking, all tooluse / toolresult) from the messages array.

This is the only model-driven level in the entire pipeline—other levels are automatic decisions made by the client.

Deletion Unit: An Entire User Turn

The most important thing to understand about Snip is that it works by user turn as the unit, not by single message.

A user turn looks like this:

TEXT

[ user input ][ assistant/tool_use ][ tool_result ][ assistant/tool_use ][ tool_result ]...
 ↑ tagged with [id:xxxxxx]                                                                  ↑ until next real user input

When the model calls SnipTool on a certain ID, everything from that user input start to before the next user input disappears from subsequent API requests.

When It Triggers

Three time points:

Before each API request: On the copy of messages about to be sent, append [id:<shortID>] tail tags to all non-meta user messages (tool_result class user messages don't count as "real user input," so they don't get IDs)
Context grows ~10K tokens without snipping: Inject a nudge attachment reminding the model "you can snip"
Model actively calls SnipTool: The actual deletion executes

What Happens After Triggering

Short IDs are derived from message UUIDs (first 10 hex chars converted to base36 first 6 chars), each user input has a stable, short, model-easy-to-recite ID
Model calls SnipTool passing one or more IDs, client collects all UUIDs for the corresponding user turn, deletes them from in-memory array
removedUuids written to transcript boundary; replayed on resume for persistence
After deletion, backtrack and fix affected parentUuids to avoid dangling chains

Key detail: [id:xxxxxx] tags are only added to the "copy sent to API", not written back to original storage. The user's original words in the transcript remain clean; only the model-visible copy carries the tag.

Complete Example: From Multi-turn Conversation to a Single Snip

Assume the session has accumulated the following history (omitting some fields for readability). Note that Claude Code appends [id:...] tags at the end of every real user input before sending to API—tool_result type user messages don't get tags.

Step 0: Original Messages Array (Version Sent to API)

JSONC

[
  // —— Turn 1: User requests TODO research ——
  {
    "type": "user",
    "message": {
      "content": "Help me find all the TODOs in the project, I want to handle them together\n[id:abc123]"
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Okay, I'll search with Grep." },
        {
          "type": "tool_use",
          "id": "toolu_01",
          "name": "Grep",
          "input": { "pattern": "TODO", "path": "src/" }
        }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_01",
          "content": "src/auth/login.ts:42: // TODO: handle refresh token\nsrc/api/user.ts:89: // TODO: validate input\n... 23 total ..."
        }
      ]
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Let me open login.ts to see that refresh token TODO's specific scenario." },
        {
          "type": "tool_use",
          "id": "toolu_02",
          "name": "Read",
          "input": { "file_path": "src/auth/login.ts" }
        }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_02",
          "content": "<full login.ts, ~4KB>"
        }
      ]
    }
  },

  // —— Turn 2: User changes mind, switches to bug fix ——
  {
    "type": "user",
    "message": {
      "content": "Forget about the TODOs for now, the bug with 500 errors on login in production is more urgent, fix this first\n[id:def456]"
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Understood, putting TODOs aside. For the login 500 error, let me check recent error logs." },
        {
          "type": "tool_use",
          "id": "toolu_03",
          "name": "Bash",
          "input": { "command": "tail -200 /var/log/app/error.log" }
        }
      ]
    }
  }
]

At this point, the model can see two real user input IDs in this turn: abc123 (TODO research) and def456 (login bug fix). Since the user explicitly said "forget about the TODOs," all content about Turn 1 (the list of 23 TODOs, full login.ts text, and corresponding thinking) is now pure token burden for the subsequent bug fix work.

Step 1: Model Actively Calls SnipTool

JSONC

{
  "type": "tool_use",
  "id": "toolu_04",
  "name": "Snip",
  "input": {
    "ids": ["abc123"],
    "reason": "User changed direction and abandoned TODO research, original turn's Grep results and login.ts full text are irrelevant to current bug fix"
  }
}

Step 2: Messages Array Sent to API in Next Turn After Snip Execution

The entire Turn 1 segment (user input + two assistant tooluses + two toolresults, 5 messages total) is physically deleted:

JSONC

[
  // Turn 1 entire segment disappears, starts directly from Turn 2's user input

  {
    "type": "user",
    "message": {
      "content": "Forget about the TODOs for now, the bug with 500 errors on login in production is more urgent, fix this first\n[id:def456]"
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Understood, putting TODOs aside. For the login 500 error, let me check recent error logs." },
        {
          "type": "tool_use",
          "id": "toolu_03",
          "name": "Bash",
          "input": { "command": "tail -200 /var/log/app/error.log" }
        }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_03",
          "content": "<log tail output>"
        }
      ]
    }
  }
]

Notable Points

Deletion is paired. toolu_01's tool_use and its corresponding tool_result disappear together, same for toolu_02. This prevents API-side errors like "toolresult found but no tooluse" or vice versa.
[id:def456] is untouched. Snip precisely targets the user turn where abc123 lives, without harming subsequent turns.
Transcript on-disk version remains unchanged. If you resume this session, Claude Code replays the same deletion based on removedUuids, keeping the model-visible view consistent—but the user's "original words" themselves always remain on disk for auditability.
Significant context savings. Turn 1's Grep results + login.ts full text is roughly 6~8K tokens, saved in one Snip, and this is a precision preservation of original text, not a summary.

Why This Design

The weakness of traditional compact is "one-size-fits-all summarization"—coarse granularity, easily losing useful original text along with the chaff. Snip is the opposite: the model itself judges which turn is obsolete, precisely excising the entire turn, while remaining recent messages stay original. They complement each other:

When you just pivot direction once and want to discard a detour, use Snip
When the entire context is overloaded and there's no obvious "this turn is obsolete" boundary, use Microcompact / Autocompact later on

Level 3: Microcompact (Lightweight Rewrite)

What It Is

Lightweight compression targeting only old tool results. It doesn't summarize conversation, doesn't call models, doesn't modify user messages—it only does one thing: replace old large tool_result.content with placeholders or cache edit instructions.

It only processes results from these tools: Read, Bash, Grep, Glob, WebSearch, WebFetch, Edit, Write. User text, model thinking, plan, attachments—it touches none of them.

When It Triggers

Two independent paths:

Path A: Time-based Microcompact

Disabled by default, when enabled:
More than 60 minutes since last assistant message +
On main thread +
Check once before each request

Path B: Cached Microcompact

Feature flag enabled +
Model supports cache editing +
On main thread +
Check once before each request

What Happens After Triggering

Time-based path—directly modifies local messages:

Find all "compressible tool" tool_results by tool id
Keep the most recent 5, replace remaining content original text with literal string [Old tool result content cleared]
Also reset cached microcompact module state (to avoid cache references to invalidated tool ids)

Cached path—local messages unchanged, instead adds cache_edits at API layer:

Local array shows those old tool_results looking untouched
But when sending to Anthropic, the payload includes an extra cache_edits directive telling the server "you can delete segment xxx from your cache"
Benefit is prompt cache prefix is preserved as much as possible, avoiding the "one move and all cache misses" of time-based approach

Additionally there's a layer of API-native Context Management, not done by client but natively supported by Anthropic API:

JSONC

{ "type": "clear_thinking_20251015", "keep": "all" }
{
  "type": "clear_tool_uses_20250919",
  "trigger": { "type": "input_tokens", "value": 180000 },
  "clear_at_least": { "type": "input_tokens", "value": 140000 }
}

These blocks are added to API parameters, and server-side automatically cleans tool_use content when exceeding 180K input tokens.

Complete Example: Time-based Path

To keep the example readable, here's a scaled-down scenario—assuming keepRecent = 2 (default is 5). Scenario: You have Claude research a project for you, ran 3 tools in succession, then went to lunch, came back 70 minutes later to continue asking questions.

Step 0: Messages Array Before Leaving

JSONC

[
  {
    "type": "user",
    "message": { "content": "Help me understand this project's structure\n[id:abc123]" }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Let me check the directory first." },
        { "type": "tool_use", "id": "toolu_01", "name": "Glob", "input": { "pattern": "**/*.ts" } }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_01",
          "content": "src/index.ts\nsrc/auth/login.ts\nsrc/api/user.ts\n... 147 files total ..."
        }
      ]
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Reading the entry file." },
        { "type": "tool_use", "id": "toolu_02", "name": "Read", "input": { "file_path": "src/index.ts" } }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_02",
          "content": "import { createApp } from './app'\nimport { loadConfig } from './config'\n... <full 4KB source> ..."
        }
      ]
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Now scanning config-related definitions." },
        { "type": "tool_use", "id": "toolu_03", "name": "Grep", "input": { "pattern": "loadConfig", "path": "src/" } }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_03",
          "content": "src/config.ts:12: export function loadConfig() {\nsrc/config.ts:34:   return loadConfig()\n... 8 matches total ..."
        }
      ]
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Project structure is roughly single entry + config/auth/api modules. Which block do you want to dive into first?" }
      ]
    }
    // ← Last assistant timestamp recorded as 12:00
  }
]

Step 1: User Returns After 70 Minutes

JSONC

{
  "type": "user",
  "message": { "content": "Let's start with the auth module\n[id:def456]" }
  // ← Current time 13:10, 70 minutes since last assistant (12:00)
}

At this moment, Time-based Microcompact trigger conditions are met: main thread + has previous assistant + gap > 60 minutes.

Step 2: Microcompact Scans the Messages Array

Finds all "compressible tool" tool_results from old to new:

TEXT

toolu_01 (Glob)   toolu_02 (Read)   toolu_03 (Grep)
   ↑ oldest            ↑ second            ↑ newest

Keeps the most recent keepRecent = 2 (toolu_02 / toolu_03), replaces remaining content with placeholder.

Step 3: Transformed Messages Array

JSONC

[
  // user_1 + corresponding assistant.tool_use completely untouched
  {
    "type": "user",
    "message": { "content": "Help me understand this project's structure\n[id:abc123]" }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Let me check the directory first." },
        { "type": "tool_use", "id": "toolu_01", "name": "Glob", "input": { "pattern": "**/*.ts" } }
      ]
    }
  },

  // ★ toolu_01's tool_result.content replaced
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_01",
          "content": "[Old tool result content cleared]"
        }
      ]
    }
  },

  // toolu_02 / toolu_03 original text preserved (within keepRecent window)
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Reading the entry file." },
        { "type": "tool_use", "id": "toolu_02", "name": "Read", "input": { "file_path": "src/index.ts" } }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_02",
          "content": "import { createApp } from './app'\nimport { loadConfig } from './config'\n... <full 4KB source> ..."
        }
      ]
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Now scanning config-related definitions." },
        { "type": "tool_use", "id": "toolu_03", "name": "Grep", "input": { "pattern": "loadConfig", "path": "src/" } }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        {
          "type": "tool_result",
          "tool_use_id": "toolu_03",
          "content": "src/config.ts:12: export function loadConfig() {\nsrc/config.ts:34:   return loadConfig()\n... 8 matches total ..."
        }
      ]
    }
  },

  // Assistant wrap-up + new user input appended as-is
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Project structure is roughly single entry + config/auth/api modules. Which block do you want to dive into first?" }
      ]
    }
  },
  {
    "type": "user",
    "message": { "content": "Let's start with the auth module\n[id:def456]" }
  }
]

Notable Points

tool_use_id is not deleted—toolu_01's tooluse (with parameters pattern: "**/*.ts") is fully preserved, only the corresponding toolresult content is replaced with a literal string. The API-side tool_use ↔ tool_result pairing remains valid.
Model can still judge "which tools were run". It can see toolu_01 was a Glob("**/*.ts") call, just the specific return is obsolete. If it really needs it later, it can call Glob again to re-fetch.
User text, assistant thinking untouched—Microcompact only targets content of tool_results.
Volume gains. In this example toolu01's Glob output is roughly 3KB, a placeholder is 31 bytes. In real sessions you likely have 10+ old toolresults, each several KB to tens of KB, saving a solid 5K~50K tokens.

Differences in the Cached Path

The key to the Cached path is local messages completely unchanged, replacement happens on the server-side cache. Comparison below (using the above scenario but taking the Cached path):

JSONC

// Local messages array: All original text preserved, no [Old ... cleared] anywhere

// API request body sent to Anthropic (conceptual):
{
  "model": "claude-xxx",
  "messages": [ /* The complete array above, original text unchanged */ ],
  "cache_edits": [
    // Client-maintained pending edits, telling server
    // "You can delete the segment corresponding to toolu_01 from your prompt cache"
    { "action": "clear", "tool_use_id": "toolu_01" }
  ]
}

The benefit is: prompt cache prefix won't be interrupted. The time-based approach of directly modifying local messages changes the cache key, causing all cache hits to drop to zero next request. The Cached path lets the server "delete internally," releasing token cost while preserving cache hit rate.

Essentially a Hot/Cold Cache Distinction

Stepping back, the division of labor between the two paths is actually a cache state machine:

Hot cache (cache still valuable) → Take Cached path, API layer cache_edits for fine-grained editing
Cold cache (60 minutes inactive, cache likely expired) → Take Time-based path, abandon cache hits for context space

And in the source code these two paths are short-circuit relationship—Time-based checks first, once hit it returns directly, no longer going through Cached. This order is also the natural hot/cold inference: since you've already judged "cache is cold," doing cache editing is meaningless.

Scenario Summary

Active use: Cached path silently trims server-side cache in background, local experience is 0 change, but requests are cheaper
Returning after long pause: Time-based path directly clears old tool_results locally, trading cache hits for context space
Approaching token limit: API-native clear_tool_uses_20250919 as safety net, server automatically cleans at 180K threshold

Level 4: Context Collapse

This level exists in the query pipeline, positioned after Microcompact and before Autocompact. Enabling it suppresses active Autocompact—in Claude Code's design, Collapse and Autocompact compete for the same headroom, so when Collapse is on shouldAutoCompact() directly returns false, letting Collapse take over.

From transcripts you can see Collapse drops two types of records:

marble-origami-commit: Append-only splice instruction, recording "how to fold a segment of history into a summary placeholder," containing collapseId / summaryUuid / summaryContent / firstArchivedUuid / lastArchivedUuid
marble-origami-snapshot: Last-wins staged state snapshot, containing staged spans / armed flag / lastSpawnTokens

These record structures imply Collapse is doing "segmented archiving + summary placeholder"—roughly working by scoring early history, selecting, and packaging into an archived unit with summary, letting that history be replaced by a placeholder in subsequent requests. Finer details like staged span selection algorithms, summary placeholder specific formats, trigger threshold chains are not expanded in this article.

Level 5: Autocompact (Heavy Fallback)

What It Is

The last line of defense when the previous four levels couldn't compress context enough. It doesn't do compression itself, but chooses one of two sub-paths:

Preferred: Session Memory Compact (reads a continuously maintained summary.md)
Fallback: Traditional LLM Compact (temporarily calls a model once for 9-section summary)

When It Triggers

When shouldAutoCompact() judges "context approaching token limit." Note that if Context Collapse is enabled, this step is skipped directly (letting Collapse handle it).

Post-trigger flow:

TEXT

autoCompactIfNeeded()
  ├── Try Session Memory Compact first
  │     ├── Success → Return result
  │     └── Failure (null) → Fallback
  └── Traditional LLM Compact

Sub-path A: Session Memory Compact (Structured Summary)

What It Is

Core idea: Don't wait until context explodes to start summarizing, continuously maintain a structured summary file in the background, and read this file directly when explosion happens.

The benefits are clear:

Zero API cost when triggering compression—no need to temporarily call a model, just read disk
More stable summary quality—updated every once in a while in background, more systematic coverage than "temporarily generating one"
Can continuously precipitate cross-turn information—error corrections made, project cognition, can all accumulate in one file

Session Memory File Storage Location

Filename summary.md, full path:

TEXT

{projectDir}/{sessionId}/session-memory/summary.md

Note this is per session, not project-level shared. The reason is straightforward—different sessions do different things, sharing one would cause cross-interference.

10-Section Template

summary.md is not free-form diary, but background agent filling in blanks according to fixed template. Initialized with an empty template, subsequent extractions only update body text. Complete template has 10 sections:

#	Section	What this section holds (guidance)
1	Session Title	An information-dense 5-10 word session title, no filler words
2	Current State	What is actively being worked on right now? Unfinished tasks, what to do next
3	Task specification	What is the user trying to build? Any design decisions or explanatory context
4	Files and Functions	Which files matter? What do they contain, why are they relevant
5	Workflow	Which bash commands are commonly used, in what order, how to interpret output
6	Errors & Corrections	What errors were encountered, how were they fixed, what did the user correct, which paths don't work
7	Codebase and System Documentation	Important system components, how they collaborate
8	Learnings	What approaches worked, what didn't, what to avoid (don't repeat content from other sections)
9	Key results	If user explicitly requested a specific result (answer, table, document), preserve it verbatim here
10	Worklog	Step by step what was attempted, minimal summary per step

Note: Distinguish this from the 9-section summary format of Traditional LLM Compact later—that's the summary format temporarily generated by the model for the Autocompact fallback path, with different Sections (e.g., "All user messages", "Current Work", "Optional Next Step" etc. more conversation-context-oriented items). Session Memory's 10-section template is more "project memory" oriented.

When It Triggers

Two layers: When does background update summary.md vs When does Autocompact read it.

Background update triggers (default thresholds):

First initialization: Current messages reach 10000 tokens
Incremental update condition: (token growth ≥ 5000 && tool calls ≥ 3) || (token growth ≥ 5000 && last round had no tool calls)
Only runs on querySource === 'repl_main_thread', subagent / teammate don't run

Autocompact call timing (Sub-path A first-try entry):

shouldAutoCompact() judges compression needed
Wait for any ongoing background extraction to finish
Read summary.md; if file doesn't exist or is still empty template, return null to yield to fallback
Otherwise execute compression

`lastSummarizedMessageId` Lifecycle

This is one of Session Memory Compact's core states, determining "which messages after this belong to the retention zone." Without understanding this, you can't understand the retention algorithm below.

Semantics: UUID of the last message absorbed by summary.md

That is, messages with uuid ≤ lastSummarizedMessageId have already been digested by Session Memory; new messages (uuid > lastSummarizedMessageId) are the increment to be processed next extraction.

Update timing and value

After background extraction ends, not updated unconditionally, but has a safety gate:

PYTHON

# Pseudocode
def update_last_summarized_message_id_if_safe(messages):
    # Is the last assistant still waiting for tool_result?
    if has_tool_calls_in_last_assistant_turn(messages):
        return  # Don't update, avoid cutting an unclosed tool pair in the middle

    last = messages[-1]
    if last.uuid:
        set_last_summarized_message_id(last.uuid)

Why this gate?

Because lastSummarizedMessageId is used by the compact phase to calculate "retention zone start." If updated at the moment "assistant just initiated tooluse, toolresult hasn't returned yet," subsequent compact might classify tooluse as "already summarized" and toolresult as "retention zone"—the API request would error 400 with "toolresult can't find corresponding tooluse." This gate ensures updates happen at natural breakpoints in conversation.

Retention Window Algorithm Details

What calculateMessagesToKeepIndex() does in source code, written as pseudocode:

PYTHON

# Default config
MIN_TOKENS = 10_000
MIN_TEXT_BLOCK_MESSAGES = 5
MAX_TOKENS = 40_000

def calculate_messages_to_keep_index(messages):
    # 1) Find boundary point
    last_index = find_index(messages, by_uuid=last_summarized_message_id)
    start = last_index + 1  # Default: after the one digested by summary, keep all

    # 2) Count current retention zone token / text-block counts
    total_tokens = sum_tokens(messages[start:])
    text_block_count = count_text_block_messages(messages[start:])

    # 3) If already satisfies minimum → done
    if total_tokens >= MIN_TOKENS and text_block_count >= MIN_TEXT_BLOCK_MESSAGES:
        return adjust_index_to_preserve_api_invariants(messages, start)

    # 4) Otherwise, expand start backwards from start-1 (taking earlier)
    #    Until simultaneously satisfying (total_tokens >= MIN_TOKENS AND text_block_count >= MIN)
    #    Or total_tokens >= MAX_TOKENS hard limit reached first
    #    Or hit floor (previous compact boundary, can't cross)
    for i in range(start - 1, floor - 1, -1):
        total_tokens += token_count(messages[i])
        if has_text_blocks(messages[i]):
            text_block_count += 1
        start = i

        if total_tokens >= MAX_TOKENS:
            break  # Hard limit priority

        if total_tokens >= MIN_TOKENS and text_block_count >= MIN_TEXT_BLOCK_MESSAGES:
            break  # Minimums simultaneously satisfied

    # 5) Finally align API constraints
    return adjust_index_to_preserve_api_invariants(messages, start)

Key points:

Minimum is AND, not OR. Must have "enough token count and enough text-block count." This avoids degenerate scenarios—like 50K pure tool_result (a bunch of large file Reads) satisfying tokens but only 2 messages with actual text, leaving the model with almost no conversational continuity. The text-block minimum ensures "at least 5 messages actually talking" remain.
Maximum is hard stop. Once totalTokens exceeds 40K, loop breaks and stops expanding further. This is the capacity ceiling of the retention zone, not a minimum guarantee.
Scan direction is "start moving backwards." The end (messages array tail) never moves; what moves is the start. Each i-- pulls an earlier message into the retention zone.
Don't cross compact boundary floor. If compact already happened before, retention zone forward expansion stops at the previous compact boundary.
Finally API invariants alignment: If start happens to land on toolresult but the paired tooluse is in an earlier assistant message, that assistant is pulled forward too. Similarly handles thinking block merge requirements.

Post-Compact Attachments Panorama

Many think compact output is just "boundary + summary + recent messages" three-piece. Actually a string of attachments hangs behind, and they are the key for the model to quickly continue working after compact.

buildPostCompactMessages() assembly order is:

TEXT

[boundary marker]
  → [summary messages]
  → [messages to keep (retention zone verbatim)]
  → [attachments ← batch of meta messages, injected by 8 types below]
  → [hook results (session start hooks)]

These 8 attachment types have different trigger conditions:

#	Attachment Type	Injected Content	When it appears
1	`file_reference`	Recently read files, verbatim excerpt	Have recent Read files not in retention zone
2	`plan_file_reference`	Current session's plan file	Have active plan
3	`invoked_skills`	Skills activated this session	Activated any skill
4	`plan_mode`	Plan mode status hint	Currently in plan mode
5	`task_status`	Background running agent / task status	Have background async agent running
6	`deferred_tools_delta`	Tool list changes vs pre-compact	Tool list changed
7	`agent_listing_delta`	Agent list changes	Agent list changed
8	`mcp_instructions_delta`	MCP instruction changes	MCP instructions changed

Budgets (default constants):

File attachments: max 5, total ≤ 50K tokens, single file ≤ 5K tokens
Skill attachments: total ≤ 25K tokens, single skill ≤ 5K tokens

In other words, Claude Code does strict budget control on "what can be in the inventory"—summary tells you "what are we doing," attachments guarantee you "raw materials to continue." The division is very clear.

Complete Example: Two-Hour Session Before/After Compression

Scenario: You and Claude discussed project auth refactoring for 2 hours, ran 40+ tool calls in between, modified a dozen files. Context grew near threshold, recent rounds are implementing AuthSession.refresh(). Background summary.md has been continuously updating.

Step 0: Current `summary.md` Content (On Disk)

After filling the 10-section template, it looks roughly like this (showing example fills for first few sections, real file has all 10 sections):

# Session Title
Refactor auth middleware for compliance rewrite

# Current State
Migrating session token storage from cookies to encrypted Redis.
Pending: integration tests for multi-device login path.
Immediate next: finish `AuthSession.refresh()` branch.

# Task specification
- Remove raw session tokens from client cookies (legal/compliance request)
- Introduce AuthSession wrapper that holds an opaque id, with real data in Redis
- Preserve existing /login and /logout API shape; only storage layer changes
- Ensure refresh flow works across multi-device, no forced logout on other devices

# Files and Functions
- src/auth/AuthSession.ts (new): wraps opaque id + Redis-backed metadata
- src/auth/login.ts (updated): issues AuthSession on credential success
- src/auth/middleware.ts (updated): validates AuthSession on every request
- src/redis/sessionStore.ts (new): typed Redis gateway for session records

# Workflow
- `pnpm test auth/` to run auth unit tests
- `pnpm run dev:redis` to boot a local Redis for integration runs
- Error "ECONNREFUSED 6379" means Redis isn't up; start it before tests

# Errors & Corrections
- Early attempt used HMAC-signed cookies — rejected by legal (still stores session data client-side)
- First Redis schema used JSON strings — switched to hashes for partial-field updates
- User corrected: refresh must NOT invalidate other devices' session ids

# Codebase and System Documentation
- Auth path: login.ts → middleware.ts → AuthSession.ts → Redis
- Session ids are opaque; all real data lives in Redis under `session:{id}`
- TTL is sliding: every successful request extends expiry by 30 days

# Learnings
- Opaque id format needs to be URL-safe (base58 chosen over base64)
- Fail-closed fallback is acceptable because re-login UX is already smooth

# Key results
- (none yet — refresh implementation in progress)

# Worklog
- Drafted AuthSession type and Redis gateway
- Updated login/logout to issue/revoke AuthSession
- Updated middleware to validate AuthSession on each request
- Started refresh() branch; paused to handle multi-device concern

Note this file is fixed template + background agent filling, not free-form diary. Each section below has an italic guidance line (like "What is actively being worked on right now?"), background agent fills based on these guidances.

Step 1: Messages Array Before Compression (Simplified Illustration)

JSONC

[
  // ========== Early 100+ messages (total ~80K tokens) ==========

  // UUID: u001 — First user input of session
  {
    "type": "user",
    "message": { "content": "We need to refactor auth middleware, legal gave new requirements\n[id:a1b2c3]" }
  },
  { "type": "assistant", "message": { "content": [/* ... long discussion ... */] } },
  { "type": "user", "message": { "content": [/* tool_result: read middleware.ts */] } },
  // ... omitting 100+ interleaved user/assistant/tool_result messages here ...

  // UUID: u128 — This is lastSummarizedMessageId (summary.md digested up to here)

  // ========== Recent several messages (total ~12K tokens, within 10K~40K retention window) ==========

  // UUID: u129
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "I sketched out the main branch of AuthSession.refresh, writing the core path first." },
        { "type": "tool_use", "id": "toolu_80", "name": "Read", "input": { "file_path": "src/auth/AuthSession.ts" } }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        { "type": "tool_result", "tool_use_id": "toolu_80", "content": "<Current AuthSession.ts content>" }
      ]
    }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "refresh needs to update Redis TTL and write back new opaque id. Writing a version." },
        { "type": "tool_use", "id": "toolu_81", "name": "Edit", "input": { "file_path": "src/auth/AuthSession.ts", "old_string": "...", "new_string": "..." } }
      ]
    }
  },
  {
    "type": "user",
    "message": {
      "content": [
        { "type": "tool_result", "tool_use_id": "toolu_81", "content": "File edited." }
      ]
    }
  },
  {
    "type": "user",
    "message": { "content": "Wait refresh needs to consider multi-device scenario\n[id:m9n8o7]" }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Okay, then refresh can't invalidate other devices' session ids. Let me adjust the design slightly..." }
      ]
    }
  }
]

Step 2: Autocompact Triggers, Chooses Session Memory Compact Path

Algorithm does three things:

Cut boundary: Based on lastSummarizedMessageId = u128, u129 and after belong to "retention zone"
Adjust API invariants: Retention zone first item is assistant + tooluse, paired toolresult also in retention zone—ok, no need to prepend
Generate summary message: Wrap summary.md body into a user message

Step 3: Messages Array After Compression

JSONC

[
  // ---- ① Compact boundary (with preservedSegment metadata for relink on resume) ----
  {
    "type": "user",
    "isMeta": true,
    "message": { "content": "<compact boundary>" },
    "compactMetadata": {
      "preservedSegment": {
        "headUuid": "u001",
        "anchorUuid": "u128",
        "tailUuid": "u134"
      }
    }
  },

  // ---- ② Summary message (from summary.md, injected as user role) ----
  {
    "type": "user",
    "isMeta": true,
    "message": {
      "content": "Below is a summary of the session so far:\n\n# Session Title\nRefactor auth middleware for compliance rewrite\n\n# Current State\nMigrating session token storage from cookies to encrypted Redis.\n...\n# Pending tasks\n- AuthSession.refresh() implementation\n- Integration tests for multi-device case\n- Rollout plan (feature flag name: `auth_opaque_sessions_v2`)"
    }
  },

  // ---- ③ Recent messages: Preserved verbatim (u129 ~ u134, not a word changed) ----
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "I sketched out the main branch of AuthSession.refresh, writing the core path first." },
        { "type": "tool_use", "id": "toolu_80", "name": "Read", "input": { "file_path": "src/auth/AuthSession.ts" } }
      ]
    }
  },
  { "type": "user", "message": { "content": [/* toolu_80 tool_result verbatim */] } },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "refresh needs to update Redis TTL and write back new opaque id. Writing a version." },
        { "type": "tool_use", "id": "toolu_81", "name": "Edit", "input": {/* ... */} }
      ]
    }
  },
  { "type": "user", "message": { "content": [/* toolu_81 tool_result verbatim */] } },
  { "type": "user", "message": { "content": "Wait refresh needs to consider multi-device scenario\n[id:m9n8o7]" } },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Okay, then refresh can't invalidate other devices' session ids. Let me adjust the design slightly..." }
      ]
    }
  },

  // ---- ④ Post-compact attachments (injected sequentially by trigger conditions) ----

  // ④-1 file_reference: Recently read ≤5 files (50K token budget, 5K per file)
  {
    "type": "user",
    "isMeta": true,
    "message": {
      "content": [
        { "type": "text", "text": "<attachment: src/auth/AuthSession.ts>\n<Full content ≤5K tokens>" },
        { "type": "text", "text": "<attachment: src/auth/login.ts>\n<Full content>" },
        { "type": "text", "text": "<attachment: src/redis/sessionStore.ts>\n<Full content>" }
      ]
    }
  },

  // ④-2 plan_file_reference: Active plan file
  {
    "type": "user",
    "isMeta": true,
    "message": { "content": "<plan: implement AuthSession.refresh with multi-device support>" }
  },

  // ④-3 invoked_skills: Skills activated this session (≤5K per skill, total ≤25K)
  {
    "type": "user",
    "isMeta": true,
    "message": { "content": "<invoked skills: test-driven-development, systematic-debugging>" }
  },

  // ④-4 plan_mode: If currently in plan mode, attach a status hint
  // (Not in plan mode in this example, skip)

  // ④-5 task_status: If background async agent is running
  // (No background task in this example, skip)

  // ④-6 deferred_tools_delta / ④-7 agent_listing_delta / ④-8 mcp_instructions_delta
  // These three deltas only inject if tool list/agent list/MCP instructions changed vs pre-compact
  // (Assume no changes in this example, skip)

  // ---- ⑤ Session start hooks ----
  {
    "type": "user",
    "isMeta": true,
    "message": { "content": "<session start hook output>" }
  }
]

Notable Points

The bolded line: "recent messages preserved verbatim" is the essential difference between SM-compact and traditional LLM compact. The model receives your actual words, actual tools run, actual results—not an LLM paraphrase.
Summary comes from file, not temporary model call. Because summary.md is continuously maintained in background, at the moment Autocompact triggers the client doesn't need to initiate another API call, just reads disk. This is why it's first try—faster and cheaper than the fallback LLM Compact.
API invariants won't break. Retention zone toolresults can always find corresponding tooluses. The algorithm checks and prepends the assistant message containing the tool_use if necessary, even if its original position was before lastSummarizedMessageId.
preservedSegment is the hook for relinking. The headUuid / anchorUuid / tailUuid on the compact boundary record "where to where is the retention segment." Resume uses these three UUIDs to reconnect the compacted view with the original transcript.
Post-compact attachments are not part of the summary, they're the "inventory." These attachments' role is: summary tells you "we're modifying AuthSession," attachments guarantee AuthSession.ts's latest source is right there in context, no need to Read again. This budget is fixed 50K tokens, 5K per file, max 5 files.

Sub-path B: Traditional LLM Compact (9-Section Summary)

What It Is

The old path taken when Session Memory Compact fails (most common reason: summary.md hasn't reached initialization threshold before exploding). The approach is straightforward: temporarily call a model once, let it generate a 9-section structured summary of the current session, then replace original history with this summary.

When It Triggers

When trySessionMemoryCompaction() returns null.

What Happens After Triggering

The core is a "conversation within a conversation"—client constructs a new API request using current messages:

PYTHON

# Pseudocode
summary_prompt = build_compact_prompt()   # Full text see "Appendix" at end of this section
summary_request = { "role": "user", "content": summary_prompt }

api_messages = normalize(strip_images(strip_attachments([
  *get_messages_after_compact_boundary(messages),   # Current history (excluding previously compacted parts)
  summary_request                                    # Append "please summarize" user message at end
])))

summary_text = call_api(
  messages = api_messages,
  system   = "You are a helpful AI assistant tasked with summarizing conversations.",
  thinking = DISABLED,
  source   = "compact"
)

The actual api_messages array sent to the summarization model looks roughly like this—the big middle section is the complete original history of the current session, with brackets indicating omitted portions:

JSONC

[
  // ========== Conversation history (images stripped, reinjected attachments stripped) ==========

  {
    "type": "user",
    "message": { "content": "Help me refactor auth middleware, legal gave new requirements\n[id:a1b2c3]" }
  },
  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Okay, reading middleware.ts first." },
        { "type": "tool_use", "id": "toolu_01", "name": "Read", "input": { "file_path": "src/auth/middleware.ts" } }
      ]
    }
  },
  {
    "type": "user",
    "message": { "content": [{ "type": "tool_result", "tool_use_id": "toolu_01", "content": "<middleware.ts verbatim>" }] }
  },

  // [...omitting N real historical messages: continuing user input / assistant tool_use / tool_result / thinking etc.
  //    In real sessions this segment is usually 50~200 messages, total 80K~150K tokens, precisely because it's too large that compact triggers...]

  {
    "type": "assistant",
    "message": {
      "content": [
        { "type": "text", "text": "Multi-device branch of refresh half-written, now stuck on whether to broadcast session changes globally." }
      ]
    }
  },

  // ========== Summary request appended at end (compact "instruction") ==========

  {
    "type": "user",
    "message": {
      "content": "CRITICAL: Respond with TEXT ONLY. Do NOT call any tools.\n\n- Do NOT use Read, Bash, Grep, Glob, Edit, Write, or ANY other tool.\n- You already have all the context you need in the conversation above.\n- Tool calls will be REJECTED and will waste your only turn — you will fail the task.\n- Your entire response must be plain text: an <analysis> block followed by a <summary> block.\n\nYour task is to create a detailed summary of the conversation so far ...\n\n[Full text see \"Appendix: Complete Compact Prompt\" at end of this section]"
    }
  }
]

Three key settings in call parameters:

system fixed sentence: "You are a helpful AI assistant tasked with summarizing conversations."
thinkingConfig explicitly disabled—summary task doesn't need extended thinking
querySource = "compact"—marks this as a "summary call," won't trigger context management flows like compact / snip again (avoiding recursion)

Model should return <analysis>...</analysis><summary>...</summary> two sections of plain text. Client extracts <summary> part as 9-section summary body, then calls buildPostCompactMessages() to assemble new main thread messages (shares same assembly function with Session Memory Compact).

Summary fixedly requires 9 sections (full prompt see appendix at end of this section):

TEXT

1. Primary Request and Intent
2. Key Technical Concepts
3. Files and Code Sections
4. Errors and fixes
5. Problem Solving
6. All user messages
7. Pending Tasks
8. Current Work
9. Optional Next Step

What If Compact Itself Gets prompt-too-long

This is an easily overlooked but critical self-rescue mechanism. That "conversation within a conversation" itself is an API call, and it can also return PTL error—especially when the session is already huge when triggering compact, sending "current history + long prompt" together easily exceeds token limits.

Compact won't let the session die completely, allowing up to 3 PTL retries, flow as follows:

CodeBlock Loading...

Corresponding pseudocode:

PYTHON

# Pseudocode
MAX_PTL_RETRIES = 3

for attempt in range(MAX_PTL_RETRIES):
    try:
        return call_compact(messages)
    except PromptTooLong as e:
        if e.token_gap:
            # Cut head precisely by API-returned tokenGap
            messages = drop_oldest_groups_until_gap_covered(messages, e.token_gap)
        else:
            # Fallback: Cut oldest 20% of rounds
            messages = drop_oldest_20_percent(messages)

        # If after cutting first item becomes assistant, API will reject (first must be user)
        # Insert synthetic meta marker
        if first_is_assistant(messages):
            messages.insert(0, {
                "type": "user",
                "isMeta": True,
                "message": { "content": "[earlier conversation truncated for compaction retry]" }
            })

Notable design choices:

Drop by round as unit, not by single message. A round roughly corresponds to "user input → assistant's series of tooluse/thinking → final text reply." Dropping by round ensures tooluse and tool_result aren't severed, avoiding creating new API invariants violations.
TokenGap-based dropping has higher priority. When API explicitly tells you "you exceeded by X tokens," accurately dropping rounds totaling X tokens is enough; only when tokenGap is unavailable fall back to dropping 20%.
Synthetic meta marker is structural tax. [earlier conversation truncated for compaction retry] is not for users to see, nor for model to "really" read—it's purely to satisfy the API constraint that "first message must be user."
Give up after 3 retries. If still PTL after cutting three times, it means there's a super large single message in the session (like a 100K token tool_result that Tool Result Budget didn't catch), and compact is powerless, so abort directly.

Main Thread Messages Structure After Compression

After summary model returns, client uses buildPostCompactMessages() to rebuild main thread messages:

JSONC

[
  "<compact boundary>",
  "<compact summary user message (9-section body)>",
  "<post-compact attachments: Recent 5 files, plan, invoked skills, etc. 8 types>",
  "<session start hooks>"
]

The only essential difference from Session Memory Compact: No messagesToKeep segment—Traditional LLM Compact replaces all history with summary, recent messages verbatim not preserved. Others (boundary / summary / attachments / hooks order and budgets) are completely identical, because they use the same assembly function.

Post-compact recovery budget is hardcoded constants (default):

Max 5 files recovered
All attachments total 50K tokens
Single file 5K tokens
Single skill 5K tokens
Skills total 25K tokens

Scenario

New session, discussed a very compact problem with Claude—ran a dozen large tool calls within minutes, context directly hits limit. At this point summary.md hasn't reached initialization threshold (10K tokens is the "stable enough" threshold, but your session is "dense in short time").

Session Memory Compact returns null, falls back to Traditional LLM Compact:

Main thread pauses first
Temporarily call model once, spit out 9-section structured summary
Rebuild main thread messages with summary + several recently read key files + current plan
Session continues, but recent messages verbatim not preserved—this is its biggest difference from Session Memory Compact

Appendix: Complete Compact Prompt

For easy verification, below is the full user message that getCompactPrompt() ultimately assembles and sends to the summarization model. It's拼接 by 4 segments:

TEXT

NO_TOOLS_PREAMBLE
+ BASE_COMPACT_PROMPT          ← With DETAILED_ANALYSIS_INSTRUCTION_BASE embedded
+ [Optional] customInstructions    ← If additional instructions configured via CLAUDE.md / compact instructions
+ NO_TOOLS_TRAILER

Full text (without customInstructions configuration):

TEXT

CRITICAL: Respond with TEXT ONLY. Do NOT call any tools.

- Do NOT use Read, Bash, Grep, Glob, Edit, Write, or ANY other tool.
- You already have all the context you need in the conversation above.
- Tool calls will be REJECTED and will waste your only turn — you will fail the task.
- Your entire response must be plain text: an <analysis> block followed by a <summary> block.

Your task is to create a detailed summary of the conversation so far, paying close attention to the user's explicit requests and your previous actions.
This summary should be thorough in capturing technical details, code patterns, and architectural decisions that would be essential for continuing development work without losing context.

Before providing your final summary, wrap your analysis in <analysis> tags to organize your thoughts and ensure you've covered all necessary points. In your analysis process:

1. Chronologically analyze each message and section of the conversation. For each section thoroughly identify:
   - The user's explicit requests and intents
   - Your approach to addressing the user's requests
   - Key decisions, technical concepts and code patterns
   - Specific details like:
     - file names
     - full code snippets
     - function signatures
     - file edits
   - Errors that you ran into and how you fixed them
   - Pay special attention to specific user feedback that you received, especially if the user told you to do something differently.
2. Double-check for technical accuracy and completeness, addressing each required element thoroughly.

Your summary should include the following sections:

1. Primary Request and Intent: Capture all of the user's explicit requests and intents in detail
2. Key Technical Concepts: List all important technical concepts, technologies, and frameworks discussed.
3. Files and Code Sections: Enumerate specific files and code sections examined, modified, or created. Pay special attention to the most recent messages and include full code snippets where applicable and include a summary of why this file read or edit is important.
4. Errors and fixes: List all errors that you ran into, and how you fixed them. Pay special attention to specific user feedback that you received, especially if the user told you to do something differently.
5. Problem Solving: Document problems solved and any ongoing troubleshooting efforts.
6. All user messages: List ALL user messages that are not tool results. These are critical for understanding the users' feedback and changing intent.
7. Pending Tasks: Outline any pending tasks that you have explicitly been asked to work on.
8. Current Work: Describe in detail precisely what was being worked on immediately before this summary request, paying special attention to the most recent messages from both user and assistant. Include file names and code snippets where applicable.
9. Optional Next Step: List the next step that you will take that is related to the most recent work you were doing. IMPORTANT: ensure that this step is DIRECTLY in line with the user's most recent explicit requests, and the task you were working on immediately before this summary request. If your last task was concluded, then only list next steps if they are explicitly in line with the users request. Do not start on tangential requests or really old requests that were already completed without confirming with the user first.
                       If there is a next step, include direct quotes from the most recent conversation showing exactly what task you were working on and where you left off. This should be verbatim to ensure there's no drift in task interpretation.

Here's an example of how your output should be structured:

<example>
<analysis>
[Your thought process, ensuring all points are covered thoroughly and accurately]
</analysis>

<summary>
1. Primary Request and Intent:
   [Detailed description]

2. Key Technical Concepts:
   - [Concept 1]
   - [Concept 2]
   - [...]

3. Files and Code Sections:
   - [File Name 1]
      - [Summary of why this file is important]
      - [Summary of the changes made to this file, if any]
      - [Important Code Snippet]
   - [File Name 2]
      - [Important Code Snippet]
   - [...]

4. Errors and fixes:
    - [Detailed description of error 1]:
      - [How you fixed the error]
      - [User feedback on the error if any]
    - [...]

5. Problem Solving:
   [Description of solved problems and ongoing troubleshooting]

6. All user messages:
    - [Detailed non tool use user message]
    - [...]

7. Pending Tasks:
   - [Task 1]
   - [Task 2]
   - [...]

8. Current Work:
   [Precise description of current work]

9. Optional Next Step:
   [Optional Next step to take]

</summary>
</example>

Please provide your summary based on the conversation so far, following this structure and ensuring precision and thoroughness in your response.

There may be additional summarization instructions provided in the included context. If so, remember to follow these instructions when creating the above summary. Examples of instructions include:
<example>
## Compact Instructions
When summarizing the conversation focus on typescript code changes and also remember the mistakes you made and how you fixed them.
</example>

<example>
# Summary instructions
When you are using compact - please focus on test output and code changes. Include file reads verbatim.
</example>

REMINDER: Do NOT call any tools. Respond with plain text only — an <analysis> block followed by a <summary> block. Tool calls will be rejected and you will fail the task.

Several prompt engineering details worth noting in this prompt:

Three hard prohibitions on tool calls: Opening CRITICAL, mid-section "Tool calls will be REJECTED", and ending REMINDER. For models with strong tool-calling capabilities, this high-frequency hard constraint is necessary—say it only once, and the model will still be tempted to Read to verify
Forced two-part output <analysis> → <summary>: Former is model "thinking first," latter is what actually gets written to transcript. Separated to prevent thinking process from directly polluting summary body
Section 6 "All user messages" is anti-distortion defense: Forces listing all non-tool_result user messages, preventing the model from only picking what it wants to remember
Section 9 requires "direct quotes": Next step must include verbatim excerpts, preventing "task drift" after compact—what compact fears most is the model subtly changing user intent during summarization
customInstructions as tail slot: Users can add suffixes to this prompt via CLAUDE.md or dedicated compact instructions, like "focus on typescript code changes" / "include test output verbatim"

Business Insights from the Layered Design

If we put the five levels + two sub-paths in one table:

Level	Trigger Timing	Target Object	Cost	Calls Model?
Tool Result Budget	Tool returns & before sending request	Single tool_result	Extremely low	No
Snip	Per request / Model-initiated	Entire message	Low	No (Model-driven)
Microcompact (time)	After 60min silence	Old tool_result.content	Low	No
Microcompact (cached)	Per request (cache supported)	Server-side cache view	Extremely low	No
Context Collapse	Per request	Segmented archive + summary placeholder	Medium	Yes (Summary generation)
Session Memory Compact	Autocompact preferred	Early history → summary.md	Medium (Disk read)	Background agent maintains file
Traditional LLM Compact	Autocompact fallback	Full history → 9-section summary	High (Main thread LLM call)	Yes

Notable design choices:

"Do cheap things first." Tool Result Budget is just character counting + file writing, almost zero cost; LLM Compact is main thread-level API call, heavy work. Pipeline puts cheap, fine-grained processing at front, expensive, coarse processing at back—typical cost-aware pipeline.
"Don't call model if possible." Until the final fallback path of Autocompact, no API call is spent on summarization. All previous levels are either mechanical replacement, UUID deletion, or server-side cache instructions.
"Preserve recent verbatim" is a clear value ordering. All complexity of Session Memory Compact—background continuously maintaining summary.md, lastSummarizedMessageId bookkeeping, API invariants repair—protects the same goal: recent messages preserved verbatim. Because the developer's "currently doing" often needs verbatim details, while "context铺垫" only needs knowledge-level summary.
Every level is modification of the messages array. For Anthropic API there's no mysterious compression parameter, all mechanisms land on that array in the payload. Only exceptions are Cached Microcompact's cache_edits and API-native Context Management (clear_thinking_* / clear_tool_uses_*), which are server-level conventions.

Closing Remarks

If you're building AI application context management, this pipeline gives at least three directly borrowable points:

Layer rather than single point. Don't have just one "compress when threshold hit" big hammer. Different scales, different types of bloat suit different cost treatments.
Preserve verbatim over summarization. Model reading verbatim almost always outperforms reading its own summary; preserve when possible.
Compression mechanism itself needs self-rescue. What if that API request your compact logic calls itself prompt-too-longs? Claude Code specifically wrote 3 PTL retries + synthetic marker, worth borrowing.

Claude Code Context Management

Opening: What Are We Actually Talking About

First, Let's Align on a Consensus: The Messages Array Is the Only Truth

Overview: The Five-Level Pipeline

Level 1: Tool Result Budget

What It Is

When It Triggers

What Happens After Triggering

Before/After Comparison of the Messages Array

Scenario

Level 2: Snip (Precision Trimming)

What It Is

Deletion Unit: An Entire User Turn

When It Triggers

What Happens After Triggering

Complete Example: From Multi-turn Conversation to a Single Snip

Step 0: Original Messages Array (Version Sent to API)

Step 1: Model Actively Calls SnipTool

Step 2: Messages Array Sent to API in Next Turn After Snip Execution

Notable Points

Why This Design

Level 3: Microcompact (Lightweight Rewrite)

What It Is

When It Triggers

What Happens After Triggering

Complete Example: Time-based Path

Step 0: Messages Array Before Leaving

Step 1: User Returns After 70 Minutes

Step 2: Microcompact Scans the Messages Array

Step 3: Transformed Messages Array

Notable Points

Differences in the Cached Path

Essentially a Hot/Cold Cache Distinction

Scenario Summary

Level 4: Context Collapse

Level 5: Autocompact (Heavy Fallback)

What It Is

When It Triggers

Sub-path A: Session Memory Compact (Structured Summary)

What It Is

Session Memory File Storage Location

10-Section Template

When It Triggers

lastSummarizedMessageId Lifecycle

Retention Window Algorithm Details

Post-Compact Attachments Panorama

Complete Example: Two-Hour Session Before/After Compression

Step 0: Current summary.md Content (On Disk)

Step 1: Messages Array Before Compression (Simplified Illustration)

Step 2: Autocompact Triggers, Chooses Session Memory Compact Path

Step 3: Messages Array After Compression

Notable Points

Sub-path B: Traditional LLM Compact (9-Section Summary)

What It Is

When It Triggers

What Happens After Triggering

What If Compact Itself Gets prompt-too-long

Main Thread Messages Structure After Compression

Scenario

Appendix: Complete Compact Prompt

Business Insights from the Layered Design

Closing Remarks

`lastSummarizedMessageId` Lifecycle

Step 0: Current `summary.md` Content (On Disk)