Token economics.
From prompt to production.

10 lessons — a prompt is instructions made of tokens. Context management is the economy of those tokens. Learn to optimize every one.

What Is a Prompt?

Instructions made of tokens

A prompt is an instruction — a series of tokens you send to an LLM. Every character, word, and punctuation mark is tokenized. Understanding this is the foundation: prompts are not magic strings, they are measured, budgeted, and optimized sequences of tokens.

Prompt = Instruction = Tokens

Tokenization

You

are

helpful

assistant

Fix

the

auth

bug

System prompt: 6 tokensUser message: 5 tokensTotal: 11 tokens

python

1# Every prompt is a series of tokens
2import tiktoken
3
4enc = tiktoken.encoding_for_model("gpt-5.2")
5
6system = "You are a helpful assistant."
7user = "Fix the auth bug."
8
9system_tokens = len(enc.encode(system))
10user_tokens = len(enc.encode(user))
11
12print(f"System: {system_tokens} tokens")
13print(f"User: {user_tokens} tokens")
14print(f"Total: {system_tokens + user_tokens} tokens")

Context Window

The finite token budget

Every LLM has a finite context window — the maximum number of tokens it can process in a single request. Think of it as a desk: your prompt (instructions), conversation history, retrieved context, and response space must all fit. When the desk is full, something must go.

Context Window \u2014 128K tokens

System Prompt15%

Conversation25%

Retrieved Context35%

Response Space25%

Total: 128,000 tokensModel: GPT-4oCost: ~$0.30/req

python

1# Context Window = Your Token Budget
2import tiktoken
3
4enc = tiktoken.encoding_for_model("gpt-5.2")
5WINDOW = 256_000  # max tokens
6
7system = "You are a helpful assistant."
8system_tokens = len(enc.encode(system))
9
10remaining = WINDOW - system_tokens
11print(f"Budget remaining: {remaining:,} tokens")

Trimming (Last-N)

Delete the oldest, keep the recent

The simplest token optimization strategy. When the context window fills up, delete the oldest conversation turns and keep only the last N. Like tearing pages from the front of a notebook — fast and predictable, but you lose all early context.

Trimming \u2014 Last-N Strategy

Turn 1

Turn 2

Turn 3

Turn 4

Turn 5

Turn 6

Turn 7

Turn 8

Deleted (oldest 5)Kept (last 3)

python

1# Trimming — Last-N Strategy
2def trim_history(messages, n=3):
3    """Keep only the last N turns."""
4    system = [m for m in messages if m['role'] == 'system']
5    turns = [m for m in messages if m['role'] != 'system']
6    return system + turns[-n * 2:]
7
8# 20 messages → 6 (last 3 turns)
9messages = trim_history(conversation, n=3)
10print(f"Kept {len(messages)} messages")

Summarisation

Condense to save tokens

Instead of deleting old turns, summarise the entire conversation into a compact snapshot. You preserve the big picture but lose verbatim detail. The trade-off: an extra API call (more tokens spent) vs. richer context retention. This is token economy in action.

Summarisation \u2014 Snapshot Strategy

Full History

System prompt

User: setup project

AI: created files...

User: add auth

AI: implemented...

User: fix bug #42

AI: found issue...

~4,200 tokens

Summary

Project initialized with auth module. Bug #42 identified in token validation. Current focus: fixing edge case in refresh flow.

~180 tokens (96% reduction)

	Trimming (Last-N)	Summarisation
Speed	Instant	Slow (LLM call)
Token cost	Free	Extra API call
Early context	Lost completely	Preserved (condensed)
Best for	Simple chatbots	Complex workflows
Risk	Amnesia	Detail loss

python

1# Summarisation — Token Economy
2import anthropic
3
4def summarise_history(messages, client):
5    """Condense conversation to save tokens."""
6    history_text = "\n".join(
7        f"{m['role']}: {m['content']}" for m in messages
8    )
9    response = client.messages.create(
10        model="claude-sonnet-4-20250514",
11        max_tokens=500,
12        messages=[{
13            "role": "user",
14            "content": f"Summarise this conversation:\n{history_text}"
15        }]
16    )
17    return response.content[0].text

Context Management

Token optimization and economy

Context management is the economy of tokens — deciding what goes into the window and what stays out. You allocate a token budget across system prompt, user message, retrieved context, and response space. Every token has a cost, and every token must earn its place.

Token Budget Allocation

System Prompt12%

User Message8%

Retrieved Files45%

Conversation History20%

Response Reserve15%

85% allocated15% freeWarning: near limit

python

1# Token Budget Manager
2class TokenBudget:
3    def __init__(self, limit=128_000):
4        self.limit = limit
5        self.allocations = {}
6
7    def allocate(self, name, tokens):
8        self.allocations[name] = tokens
9
10    @property
11    def remaining(self):
12        used = sum(self.allocations.values())
13        return self.limit - used
14
15    @property
16    def utilization(self):
17        return sum(self.allocations.values()) / self.limit

Context Engineering

IDE-driven, JIT context delivery

Modern IDEs don’t dump everything into the context window. They use just-in-time (JIT) context delivery — pulling in only the files, functions, and docs relevant to the current task. For long-horizon tasks spanning hundreds of tool calls, this intelligent context selection is essential.

JIT Context \u2014 Pull Only What You Need

✓ Current file

✓ Open tabs

✓ Import graph

✗ Git diff

✗ Test files

✗ Docs

3 sources active — IDE pulls context just-in-time, not all-at-once

	Dump Everything	JIT Context
Strategy	Send all files	Pull relevant files on demand
Token usage	High (wasteful)	Low (efficient)
Quality	Diluted by noise	Focused signal
Best for	Small projects	Large codebases, long tasks
Example	Paste entire repo	IDE auto-includes imports

Context Pollution

When tokens work against you

Not all tokens are equal. Irrelevant search results, stale tool outputs, and verbose error logs pollute the context — pushing out useful information and confusing the model. In a 200K token window processing 5 tickets, data from Ticket #1 clutters processing of Ticket #5.

Context Pollution \u2014 Token Waste

System prompt

500 tokens

User request

200 tokens

Stale tool output

8,200 tokens — wasted

Old KB search

3,400 tokens — wasted

Current task

600 tokens

Prev ticket draft

2,800 tokens — wasted

Useful: 1,300 tokens (8%)Pollution: 14,400 tokens (92%)

python

1# Context Pollution Detection
2def detect_pollution(messages):
3    """Flag stale or redundant content."""
4    stale = []
5    for i, msg in enumerate(messages):
6        if msg.get('tool_result'):
7            age = len(messages) - i
8            if age > 10:  # older than 10 turns
9                stale.append(i)
10    print(f"Found {len(stale)} stale entries")
11    return stale

Automatic Context Compaction

API-level token optimization

Anthropic’s compaction_control parameter automatically summarizes conversation history when token usage exceeds a threshold. In real-world tests processing 5 customer service tickets: 208K tokens → 86K tokens — a 58.6% reduction, transparently, with no code changes.

Compaction Results \u2014 5 Tickets

Before

208K

37 turns

After

86K

58.6% saved

compaction events

58.6%

token reduction

turns (vs 37)

python

1# Anthropic Automatic Context Compaction
2import anthropic
3
4client = anthropic.Anthropic()
5
6runner = client.beta.messages.tool_runner(
7    model="claude-sonnet-4-5",
8    max_tokens=4096,
9    tools=tools,
10    messages=messages,
11    compaction_control={
12        "enabled": True,
13        "context_token_threshold": 5000,
14    },
15)
16
17for message in runner:
18    total_input += message.usage.input_tokens
19    total_output += message.usage.output_tokens

Threshold	When to use	Compaction frequency
5K–20K	Sequential entity processing	Frequent, minimal accumulation
50K–100K	Multi-phase workflows	Balanced retention
100K–150K	Tasks needing full history	Rare, preserves detail
Default 100K	General long-running tasks	Standard balance

Context Editing & Memory Tool

Auto-cleaner + filing cabinet

Context Editing uses a secondary model to remove stale information (the auto-cleaner tidying the desk — up to 84% token reduction). Memory Tool provides persistent external storage (a filing cabinet) that survives across sessions. Together they improve complex task performance by 39%.

Context Editing vs Memory Tool

Context Editing

The Auto-Cleaner

84% token reduction

In-session only

Memory Tool

The Filing Cabinet

Prefs

Facts

Plans

Persists forever

Cross-session memory

	Context Editing	Memory Tool
What it does	Removes stale clutter	Saves key facts permanently
Where	On the desk (context)	In the cabinet (external)
Token saving	Up to 84%	Offloads to storage
Persistence	In-session only	Across all sessions
Combined	39% better on complex tasks	39% better on complex tasks

python

1# Context Editing Example
2def edit_context(messages, client):
3    """Remove stale info from context."""
4    prompt = "Review this conversation. Remove"
5    prompt += " outdated or redundant information."
6    prompt += " Keep decisions and current state."
7
8    response = client.messages.create(
9        model="claude-sonnet-4-20250514",
10        max_tokens=4000,
11        messages=[{
12            "role": "user",
13            "content": f"{prompt}\n\n{format_msgs(messages)}"
14        }]
15    )
16    return parse_edited_messages(response.content[0].text)

Real-World: Customer Service

Compaction in production

A complete walkthrough using Anthropic’s cookbook: 5 support tickets, 35+ tool calls, 208K tokens without compaction vs 86K with. See exactly when compaction triggers, what the summaries contain, and how to configure thresholds, custom prompts, and model selection.

Customer Service Workflow \u2014 5 Tickets

Per-ticket workflow (7 steps each)

Fetch

Classify

Research

Prioritize

Route

Draft

Complete

Linear token growth without compaction:

Turn 1204K tokens

Metric	No Compaction	With Compaction
Total turns	37	26
Input tokens	204,416	82,171
Output tokens	4,422	4,275
Total tokens	208,838	86,446
Compactions	N/A	2
Token savings	—	122,392 (58.6%)

python

1# Custom summary prompt for domain needs
2compaction_control={
3    "enabled": True,
4    "context_token_threshold": 5000,
5    "summary_prompt": (
6        "Preserve: ticket IDs, categories, "
7        "priorities, teams, outcomes. "
8        "Discard: full KB articles, draft text."
9    ),
10}
11
12# Use cheaper model for summaries
13compaction_control={
14    "enabled": True,
15    "model": "claude-haiku-4-5",
16}

Course Outline

What Is a Prompt?Instructions made of tokens

Context WindowThe finite token budget

Trimming (Last-N)Delete the oldest, keep the recent

SummarisationCondense to save tokens

Context ManagementToken optimization and economy

Context EngineeringIDE-driven, JIT context delivery

Context PollutionWhen tokens work against you

Automatic Context CompactionAPI-level token optimization

Context Editing & Memory ToolAuto-cleaner + filing cabinet

Real-World: Customer ServiceCompaction in production