Token economics.
From prompt to production.
10 lessons — a prompt is instructions made of tokens. Context management is the economy of those tokens. Learn to optimize every one.
What Is a Prompt?
Instructions made of tokens
A prompt is an instruction — a series of tokens you send to an LLM. Every character, word, and punctuation mark is tokenized. Understanding this is the foundation: prompts are not magic strings, they are measured, budgeted, and optimized sequences of tokens.
1# Every prompt is a series of tokens2import tiktoken34enc = tiktoken.encoding_for_model("gpt-5.2")56system = "You are a helpful assistant."7user = "Fix the auth bug."89system_tokens = len(enc.encode(system))10user_tokens = len(enc.encode(user))1112print(f"System: {system_tokens} tokens")13print(f"User: {user_tokens} tokens")14print(f"Total: {system_tokens + user_tokens} tokens")
Context Window
The finite token budget
Every LLM has a finite context window — the maximum number of tokens it can process in a single request. Think of it as a desk: your prompt (instructions), conversation history, retrieved context, and response space must all fit. When the desk is full, something must go.
1# Context Window = Your Token Budget2import tiktoken34enc = tiktoken.encoding_for_model("gpt-5.2")5WINDOW = 256_000 # max tokens67system = "You are a helpful assistant."8system_tokens = len(enc.encode(system))910remaining = WINDOW - system_tokens11print(f"Budget remaining: {remaining:,} tokens")
Trimming (Last-N)
Delete the oldest, keep the recent
The simplest token optimization strategy. When the context window fills up, delete the oldest conversation turns and keep only the last N. Like tearing pages from the front of a notebook — fast and predictable, but you lose all early context.
1# Trimming — Last-N Strategy2def trim_history(messages, n=3):3 """Keep only the last N turns."""4 system = [m for m in messages if m['role'] == 'system']5 turns = [m for m in messages if m['role'] != 'system']6 return system + turns[-n * 2:]78# 20 messages → 6 (last 3 turns)9messages = trim_history(conversation, n=3)10print(f"Kept {len(messages)} messages")
Summarisation
Condense to save tokens
Instead of deleting old turns, summarise the entire conversation into a compact snapshot. You preserve the big picture but lose verbatim detail. The trade-off: an extra API call (more tokens spent) vs. richer context retention. This is token economy in action.
| Trimming (Last-N) | Summarisation | |
|---|---|---|
| Speed | Instant | Slow (LLM call) |
| Token cost | Free | Extra API call |
| Early context | Lost completely | Preserved (condensed) |
| Best for | Simple chatbots | Complex workflows |
| Risk | Amnesia | Detail loss |
1# Summarisation — Token Economy2import anthropic34def summarise_history(messages, client):5 """Condense conversation to save tokens."""6 history_text = "\n".join(7 f"{m['role']}: {m['content']}" for m in messages8 )9 response = client.messages.create(10 model="claude-sonnet-4-20250514",11 max_tokens=500,12 messages=[{13 "role": "user",14 "content": f"Summarise this conversation:\n{history_text}"15 }]16 )17 return response.content[0].text
Context Management
Token optimization and economy
Context management is the economy of tokens — deciding what goes into the window and what stays out. You allocate a token budget across system prompt, user message, retrieved context, and response space. Every token has a cost, and every token must earn its place.
1# Token Budget Manager2class TokenBudget:3 def __init__(self, limit=128_000):4 self.limit = limit5 self.allocations = {}67 def allocate(self, name, tokens):8 self.allocations[name] = tokens910 @property11 def remaining(self):12 used = sum(self.allocations.values())13 return self.limit - used1415 @property16 def utilization(self):17 return sum(self.allocations.values()) / self.limit
Context Engineering
IDE-driven, JIT context delivery
Modern IDEs don’t dump everything into the context window. They use just-in-time (JIT) context delivery — pulling in only the files, functions, and docs relevant to the current task. For long-horizon tasks spanning hundreds of tool calls, this intelligent context selection is essential.
| Dump Everything | JIT Context | |
|---|---|---|
| Strategy | Send all files | Pull relevant files on demand |
| Token usage | High (wasteful) | Low (efficient) |
| Quality | Diluted by noise | Focused signal |
| Best for | Small projects | Large codebases, long tasks |
| Example | Paste entire repo | IDE auto-includes imports |
Context Pollution
When tokens work against you
Not all tokens are equal. Irrelevant search results, stale tool outputs, and verbose error logs pollute the context — pushing out useful information and confusing the model. In a 200K token window processing 5 tickets, data from Ticket #1 clutters processing of Ticket #5.
1# Context Pollution Detection2def detect_pollution(messages):3 """Flag stale or redundant content."""4 stale = []5 for i, msg in enumerate(messages):6 if msg.get('tool_result'):7 age = len(messages) - i8 if age > 10: # older than 10 turns9 stale.append(i)10 print(f"Found {len(stale)} stale entries")11 return stale
Automatic Context Compaction
API-level token optimization
Anthropic’s compaction_control parameter automatically summarizes conversation history when token usage exceeds a threshold. In real-world tests processing 5 customer service tickets: 208K tokens → 86K tokens — a 58.6% reduction, transparently, with no code changes.
1# Anthropic Automatic Context Compaction2import anthropic34client = anthropic.Anthropic()56runner = client.beta.messages.tool_runner(7 model="claude-sonnet-4-5",8 max_tokens=4096,9 tools=tools,10 messages=messages,11 compaction_control={12 "enabled": True,13 "context_token_threshold": 5000,14 },15)1617for message in runner:18 total_input += message.usage.input_tokens19 total_output += message.usage.output_tokens
| Threshold | When to use | Compaction frequency |
|---|---|---|
| 5K–20K | Sequential entity processing | Frequent, minimal accumulation |
| 50K–100K | Multi-phase workflows | Balanced retention |
| 100K–150K | Tasks needing full history | Rare, preserves detail |
| Default 100K | General long-running tasks | Standard balance |
Context Editing & Memory Tool
Auto-cleaner + filing cabinet
Context Editing uses a secondary model to remove stale information (the auto-cleaner tidying the desk — up to 84% token reduction). Memory Tool provides persistent external storage (a filing cabinet) that survives across sessions. Together they improve complex task performance by 39%.
| Context Editing | Memory Tool | |
|---|---|---|
| What it does | Removes stale clutter | Saves key facts permanently |
| Where | On the desk (context) | In the cabinet (external) |
| Token saving | Up to 84% | Offloads to storage |
| Persistence | In-session only | Across all sessions |
| Combined | 39% better on complex tasks | 39% better on complex tasks |
1# Context Editing Example2def edit_context(messages, client):3 """Remove stale info from context."""4 prompt = "Review this conversation. Remove"5 prompt += " outdated or redundant information."6 prompt += " Keep decisions and current state."78 response = client.messages.create(9 model="claude-sonnet-4-20250514",10 max_tokens=4000,11 messages=[{12 "role": "user",13 "content": f"{prompt}\n\n{format_msgs(messages)}"14 }]15 )16 return parse_edited_messages(response.content[0].text)
Real-World: Customer Service
Compaction in production
A complete walkthrough using Anthropic’s cookbook: 5 support tickets, 35+ tool calls, 208K tokens without compaction vs 86K with. See exactly when compaction triggers, what the summaries contain, and how to configure thresholds, custom prompts, and model selection.
| Metric | No Compaction | With Compaction |
|---|---|---|
| Total turns | 37 | 26 |
| Input tokens | 204,416 | 82,171 |
| Output tokens | 4,422 | 4,275 |
| Total tokens | 208,838 | 86,446 |
| Compactions | N/A | 2 |
| Token savings | — | 122,392 (58.6%) |
1# Custom summary prompt for domain needs2compaction_control={3 "enabled": True,4 "context_token_threshold": 5000,5 "summary_prompt": (6 "Preserve: ticket IDs, categories, "7 "priorities, teams, outcomes. "8 "Discard: full KB articles, draft text."9 ),10}1112# Use cheaper model for summaries13compaction_control={14 "enabled": True,15 "model": "claude-haiku-4-5",16}