Token economics.
From prompt to production.

10 lessons — a prompt is instructions made of tokens. Context management is the economy of those tokens. Learn to optimize every one.

1

What Is a Prompt?

Instructions made of tokens

A prompt is an instruction — a series of tokens you send to an LLM. Every character, word, and punctuation mark is tokenized. Understanding this is the foundation: prompts are not magic strings, they are measured, budgeted, and optimized sequences of tokens.

Prompt = Instruction = Tokens
Tokenization
You
are
a
helpful
assistant
.
Fix
the
auth
bug
.
System prompt: 6 tokensUser message: 5 tokensTotal: 11 tokens
python
1# Every prompt is a series of tokens
2import tiktoken
3
4enc = tiktoken.encoding_for_model("gpt-5.2")
5
6system = "You are a helpful assistant."
7user = "Fix the auth bug."
8
9system_tokens = len(enc.encode(system))
10user_tokens = len(enc.encode(user))
11
12print(f"System: {system_tokens} tokens")
13print(f"User: {user_tokens} tokens")
14print(f"Total: {system_tokens + user_tokens} tokens")
2

Context Window

The finite token budget

Every LLM has a finite context window — the maximum number of tokens it can process in a single request. Think of it as a desk: your prompt (instructions), conversation history, retrieved context, and response space must all fit. When the desk is full, something must go.

Context Window \u2014 128K tokens
System Prompt15%
Conversation25%
Retrieved Context35%
Response Space25%
Total: 128,000 tokensModel: GPT-4oCost: ~$0.30/req
python
1# Context Window = Your Token Budget
2import tiktoken
3
4enc = tiktoken.encoding_for_model("gpt-5.2")
5WINDOW = 256_000 # max tokens
6
7system = "You are a helpful assistant."
8system_tokens = len(enc.encode(system))
9
10remaining = WINDOW - system_tokens
11print(f"Budget remaining: {remaining:,} tokens")
3

Trimming (Last-N)

Delete the oldest, keep the recent

The simplest token optimization strategy. When the context window fills up, delete the oldest conversation turns and keep only the last N. Like tearing pages from the front of a notebook — fast and predictable, but you lose all early context.

Trimming \u2014 Last-N Strategy
Turn 1
Turn 2
Turn 3
Turn 4
Turn 5
Turn 6
Turn 7
Turn 8
Deleted (oldest 5)Kept (last 3)
python
1# Trimming — Last-N Strategy
2def trim_history(messages, n=3):
3 """Keep only the last N turns."""
4 system = [m for m in messages if m['role'] == 'system']
5 turns = [m for m in messages if m['role'] != 'system']
6 return system + turns[-n * 2:]
7
8# 20 messages → 6 (last 3 turns)
9messages = trim_history(conversation, n=3)
10print(f"Kept {len(messages)} messages")
4

Summarisation

Condense to save tokens

Instead of deleting old turns, summarise the entire conversation into a compact snapshot. You preserve the big picture but lose verbatim detail. The trade-off: an extra API call (more tokens spent) vs. richer context retention. This is token economy in action.

Summarisation \u2014 Snapshot Strategy
Full History
System prompt
User: setup project
AI: created files...
User: add auth
AI: implemented...
User: fix bug #42
AI: found issue...
~4,200 tokens
Summary
Project initialized with auth module. Bug #42 identified in token validation. Current focus: fixing edge case in refresh flow.
~180 tokens (96% reduction)
Trimming (Last-N)Summarisation
SpeedInstantSlow (LLM call)
Token costFreeExtra API call
Early contextLost completelyPreserved (condensed)
Best forSimple chatbotsComplex workflows
RiskAmnesiaDetail loss
python
1# Summarisation — Token Economy
2import anthropic
3
4def summarise_history(messages, client):
5 """Condense conversation to save tokens."""
6 history_text = "\n".join(
7 f"{m['role']}: {m['content']}" for m in messages
8 )
9 response = client.messages.create(
10 model="claude-sonnet-4-20250514",
11 max_tokens=500,
12 messages=[{
13 "role": "user",
14 "content": f"Summarise this conversation:\n{history_text}"
15 }]
16 )
17 return response.content[0].text
5

Context Management

Token optimization and economy

Context management is the economy of tokens — deciding what goes into the window and what stays out. You allocate a token budget across system prompt, user message, retrieved context, and response space. Every token has a cost, and every token must earn its place.

Token Budget Allocation
System Prompt12%
User Message8%
Retrieved Files45%
Conversation History20%
Response Reserve15%
85% allocated15% freeWarning: near limit
python
1# Token Budget Manager
2class TokenBudget:
3 def __init__(self, limit=128_000):
4 self.limit = limit
5 self.allocations = {}
6
7 def allocate(self, name, tokens):
8 self.allocations[name] = tokens
9
10 @property
11 def remaining(self):
12 used = sum(self.allocations.values())
13 return self.limit - used
14
15 @property
16 def utilization(self):
17 return sum(self.allocations.values()) / self.limit
6

Context Engineering

IDE-driven, JIT context delivery

Modern IDEs don’t dump everything into the context window. They use just-in-time (JIT) context delivery — pulling in only the files, functions, and docs relevant to the current task. For long-horizon tasks spanning hundreds of tool calls, this intelligent context selection is essential.

JIT Context \u2014 Pull Only What You Need
Current file
Open tabs
Import graph
Git diff
Test files
Docs
3 sources active — IDE pulls context just-in-time, not all-at-once
Dump EverythingJIT Context
StrategySend all filesPull relevant files on demand
Token usageHigh (wasteful)Low (efficient)
QualityDiluted by noiseFocused signal
Best forSmall projectsLarge codebases, long tasks
ExamplePaste entire repoIDE auto-includes imports
7

Context Pollution

When tokens work against you

Not all tokens are equal. Irrelevant search results, stale tool outputs, and verbose error logs pollute the context — pushing out useful information and confusing the model. In a 200K token window processing 5 tickets, data from Ticket #1 clutters processing of Ticket #5.

Context Pollution \u2014 Token Waste
System prompt
500 tokens
User request
200 tokens
Stale tool output
8,200 tokens — wasted
Old KB search
3,400 tokens — wasted
Current task
600 tokens
Prev ticket draft
2,800 tokens — wasted
Useful: 1,300 tokens (8%)Pollution: 14,400 tokens (92%)
python
1# Context Pollution Detection
2def detect_pollution(messages):
3 """Flag stale or redundant content."""
4 stale = []
5 for i, msg in enumerate(messages):
6 if msg.get('tool_result'):
7 age = len(messages) - i
8 if age > 10: # older than 10 turns
9 stale.append(i)
10 print(f"Found {len(stale)} stale entries")
11 return stale
8

Automatic Context Compaction

API-level token optimization

Anthropic’s compaction_control parameter automatically summarizes conversation history when token usage exceeds a threshold. In real-world tests processing 5 customer service tickets: 208K tokens → 86K tokens — a 58.6% reduction, transparently, with no code changes.

Compaction Results \u2014 5 Tickets
Before
208K
37 turns
After
86K
58.6% saved
2
compaction events
58.6%
token reduction
26
turns (vs 37)
python
1# Anthropic Automatic Context Compaction
2import anthropic
3
4client = anthropic.Anthropic()
5
6runner = client.beta.messages.tool_runner(
7 model="claude-sonnet-4-5",
8 max_tokens=4096,
9 tools=tools,
10 messages=messages,
11 compaction_control={
12 "enabled": True,
13 "context_token_threshold": 5000,
14 },
15)
16
17for message in runner:
18 total_input += message.usage.input_tokens
19 total_output += message.usage.output_tokens
ThresholdWhen to useCompaction frequency
5K–20KSequential entity processingFrequent, minimal accumulation
50K–100KMulti-phase workflowsBalanced retention
100K–150KTasks needing full historyRare, preserves detail
Default 100KGeneral long-running tasksStandard balance
9

Context Editing & Memory Tool

Auto-cleaner + filing cabinet

Context Editing uses a secondary model to remove stale information (the auto-cleaner tidying the desk — up to 84% token reduction). Memory Tool provides persistent external storage (a filing cabinet) that survives across sessions. Together they improve complex task performance by 39%.

Context Editing vs Memory Tool
Context Editing
The Auto-Cleaner
84% token reduction
In-session only
Memory Tool
The Filing Cabinet
Prefs
Facts
Plans
Persists forever
Cross-session memory
Context EditingMemory Tool
What it doesRemoves stale clutterSaves key facts permanently
WhereOn the desk (context)In the cabinet (external)
Token savingUp to 84%Offloads to storage
PersistenceIn-session onlyAcross all sessions
Combined39% better on complex tasks39% better on complex tasks
python
1# Context Editing Example
2def edit_context(messages, client):
3 """Remove stale info from context."""
4 prompt = "Review this conversation. Remove"
5 prompt += " outdated or redundant information."
6 prompt += " Keep decisions and current state."
7
8 response = client.messages.create(
9 model="claude-sonnet-4-20250514",
10 max_tokens=4000,
11 messages=[{
12 "role": "user",
13 "content": f"{prompt}\n\n{format_msgs(messages)}"
14 }]
15 )
16 return parse_edited_messages(response.content[0].text)
10

Real-World: Customer Service

Compaction in production

A complete walkthrough using Anthropic’s cookbook: 5 support tickets, 35+ tool calls, 208K tokens without compaction vs 86K with. See exactly when compaction triggers, what the summaries contain, and how to configure thresholds, custom prompts, and model selection.

Customer Service Workflow \u2014 5 Tickets
Per-ticket workflow (7 steps each)
Fetch
Classify
Research
Prioritize
Route
Draft
Complete
Linear token growth without compaction:
Turn 1204K tokens
MetricNo CompactionWith Compaction
Total turns3726
Input tokens204,41682,171
Output tokens4,4224,275
Total tokens208,83886,446
CompactionsN/A2
Token savings122,392 (58.6%)
python
1# Custom summary prompt for domain needs
2compaction_control={
3 "enabled": True,
4 "context_token_threshold": 5000,
5 "summary_prompt": (
6 "Preserve: ticket IDs, categories, "
7 "priorities, teams, outcomes. "
8 "Discard: full KB articles, draft text."
9 ),
10}
11
12# Use cheaper model for summaries
13compaction_control={
14 "enabled": True,
15 "model": "claude-haiku-4-5",
16}