Dictionary
50 terms Key terms and concepts in context engineering, token economics, and AI-assisted development.
Agent Hooks
architectureDeterministic shell commands that run at key lifecycle points during agent sessions (e.g., PreToolUse, PostToolUse, SessionStart, SessionStop). Unlike instructions that guide behavior, hooks execute code with guaranteed outcomes — useful for enforcing security policies, automating code quality checks, or creating audit trails.
Agent Orchestration
architectureA pattern where multiple specialized subagents collaborate to achieve a complex goal, each operating in its own dedicated context window. Enables context efficiency (no shared overflow), specialization (different models per task), and parallel execution of independent subtasks. Examples include conductor-based systems with planning, implementation, and review agents.
Agent Skills
architectureReusable packages of domain expertise defined in SKILL.md files that provide specialized capabilities, knowledge, and refined workflows to AI agents. Skills can be scoped to specific domains like testing strategies, API design, or performance optimization, and can be invoked as slash commands or loaded automatically by the model.
Attention Budget
context managementThe effective amount of context a model can meaningfully attend to. While context windows may be large (128K+), attention quality degrades with length. The practical attention budget is often smaller than the raw token limit.
Automatic Context Compaction
memory strategiesAn API-level feature (e.g. Anthropic’s compaction_control parameter) that automatically summarizes conversation history when token usage exceeds a configurable threshold. Can achieve 50-60% token reduction transparently, without application-level code changes.
Batch Processing
token economicsAn API tier that processes requests asynchronously at reduced cost, ideal for non-time-sensitive workloads like bulk analysis or testing. Typically offers the lowest per-token pricing.
Break-Even Point
cachingThe number of requests after which prompt caching becomes cost-effective compared to standard pricing. Calculated from cache write cost vs. cumulative savings from cache reads.
Cache Read
cachingThe cost of using a previously cached prompt prefix. Significantly cheaper than standard input (typically 90% discount). The break-even point depends on how many times the cached prefix is reused.
Cache Write
cachingThe cost of storing a prompt prefix in the cache for the first time. Typically 25% more expensive than standard input pricing, but pays for itself when the same prefix is reused multiple times.
Chain of Thought (CoT)
prompt engineeringA prompting technique where the model is instructed to show its reasoning step by step before giving a final answer. Improves accuracy on complex tasks but increases output token usage.
Cobb-Douglas Model
evaluationAn economic production function adapted for AI cost analysis. Models the relationship between token inputs (quantity and quality) and output quality, helping find the optimal trade-off between cost and performance.
Compaction
context managementWhen a session nears its token limit, the assistant summarizes critical details — such as architectural decisions and unresolved bugs — while discarding redundant tool outputs. This reclaims context budget without losing essential information.
Context Editing
memory strategiesAn advanced context management technique where a secondary model reviews and removes stale or redundant information from the conversation before the next turn. Like an auto-cleaner that tidies the desk. Can achieve up to 84% token reduction while maintaining coherence.
Context Engineering
context managementThe discipline of designing and managing the information provided to an AI model to maximize output quality while minimizing token costs. Encompasses prompt design, file selection, caching strategy, and context window management.
Context Pollution
context managementThe accumulation of irrelevant, redundant, or misleading information in the context window that degrades model performance. Includes distractors, context rot, and poor structural patterns. Fighting context pollution is a core challenge of context engineering.
Context Rot
context managementAs more tokens are added to a conversation, the model’s ability to accurately retrieve specific pieces of information from the context decreases. Long conversations suffer from degraded attention, making early details harder to recall.
Context Token Threshold
memory strategiesA configurable token count (typically between 5,000 and 150,000) at which automatic context compaction is triggered. When conversation tokens exceed this threshold, the system summarizes older turns to reclaim budget.
Context Window
context managementThe maximum number of tokens (input + output) a model can process in a single request. Think of it as the AI’s short-term memory or desk — everything must fit on the desk at once. Ranges from 8K to 2M tokens depending on the model.
Copilot Memory
memory strategiesA persistent cross-session memory store for GitHub Copilot that lets the agent save and recall important information across chat sessions. The agent recognizes when to store facts (e.g., 'always ask clarifying questions') and retrieves relevant memories to inform future responses, eliminating the need to repeatedly provide the same context.
Cost per Million Tokens (MTok)
token economicsStandard pricing unit for LLM APIs. For example, Claude Sonnet 4.5 costs $3/MTok input and $15/MTok output. This metric allows comparison across providers and models.
Distractors
context managementFiles or code snippets that are topically related to the query but do not contain the answer. These can cause the model to lose focus or hallucinate, degrading output quality. Effective context engineering actively filters distractors.
Extended Thinking
prompt engineeringA model capability where additional compute is used for internal reasoning before generating a response. The thinking tokens consume budget but can dramatically improve quality on hard problems.
Function Calling / Tool Use
architectureThe ability of a model to invoke external functions or tools during generation. The model outputs structured parameters that the client executes, returning results back into the context. Enables agentic workflows.
High-Signal Tokens
token economicsThe objective of context engineering: provide the smallest possible set of tokens that maximize the likelihood of correct code generation. Every token should contribute meaningfully to the model’s understanding.
Input Tokens
token economicsTokens sent to the model in a request, including system prompts, conversation history, and user messages. Input tokens are typically cheaper than output tokens and form the bulk of context window usage.
JIT Context (Just-in-Time)
context managementA strategy where context is loaded dynamically at runtime rather than pre-loaded. The IDE or agent fetches only the files, symbols, or data needed for the current step — similar to how modern IDEs lazy-load imports. Reduces waste and keeps the context window focused.
Lightweight Identifiers
context managementThe assistant maintains references (file paths, stored queries) and dynamically loads only the necessary data at runtime using tools like grep, head, or tail. This avoids stuffing the full content into the context window upfront.
MCP Apps
architectureRich, interactive UI components rendered by MCP (Model Context Protocol) servers directly in the chat client. Enables models to display visualizations like flame graphs, flowcharts, and data dashboards inline, providing visual context alongside text-based responses.
Memory Tool
memory strategiesAn external persistent storage mechanism (like a filing cabinet) that the model can read from and write to across sessions. Unlike the context window (short-term desk), the memory tool persists information permanently. Used for user preferences, project knowledge, and cross-session continuity.
Message Steering
context managementAn agent interaction pattern where a follow-up message signals the currently running request to yield after finishing the active tool execution, then processes the new message immediately. Used to redirect an agent heading in the wrong direction without cancelling or waiting for the full response to complete.
Output Tokens
token economicsTokens generated by the model in response. Output tokens are typically 3-5x more expensive than input tokens. Controlling output length through instructions and max_tokens is a key cost optimization lever.
Progressive Disclosure
context managementInstead of loading an entire codebase — which would immediately overwhelm the attention budget — modern agents use JIT (just-in-time) context. The assistant dynamically loads only the necessary data at runtime, revealing information progressively as needed.
Prompt
prompt engineeringThe complete set of instructions and context sent to an AI model in a single request. Includes system instructions, user message, conversation history, and any retrieved context. The quality of the prompt directly determines the quality of the output.
Prompt Caching
cachingA technique where frequently-used prompt prefixes are stored server-side, allowing subsequent requests with the same prefix to be processed at reduced cost (typically 90% cheaper). Requires a minimum token threshold to activate.
RAG (Retrieval-Augmented Generation)
architectureA pattern where relevant documents are retrieved from a knowledge base and injected into the context before generation. Enables models to reference up-to-date or domain-specific information without fine-tuning.
Structural Patterns
context managementResearch finding that models often perform differently on shuffled vs. logically structured context. The placement and organization of information within the context window affects retrieval accuracy and generation quality.
Structured Note-taking
context managementThe agent maintains an external NOTES.md or to-do list to track dependencies and progress across thousands of steps. After a context reset, it can read these notes back to restore essential state without replaying the full history.
Subagent
architectureAn isolated agent invoked by a parent agent to handle a subtask in its own dedicated context window. Subagents prevent context overflow in the main agent by running independently, and can execute in parallel for tasks that can be split into independent parts.
Summarisation
memory strategiesA context management strategy that condenses the full conversation history into a compact executive summary before each new turn. Preserves the big picture but loses verbatim detail and adds latency from the summarisation LLM call.
Summary Prompt
memory strategiesA custom instruction given to the compaction system that controls how conversation history is summarized. Allows domain-specific summarisation — for example, a customer service bot can be told to always preserve order IDs, account numbers, and resolution status in the summary.
System Prompt
prompt engineeringInstructions provided at the beginning of a conversation to set the model’s behavior, personality, and constraints. System prompts are ideal candidates for caching since they remain constant across requests.
Tab Relevance Scoring
architectureA technique used in IDE extensions to rank open editor tabs by their relevance to the current task. Factors include import relationships, path similarity, edit recency, and diagnostic overlap.
Terminal Sandboxing
architectureA security mechanism that restricts file system and network access for terminal commands executed by AI agents. Sandboxed commands have read/write access only to the current workspace, and network access can be limited to trusted domains. Helps mitigate risks from agent-executed commands.
Thinking Tokens
token economicsTokens generated by a model's internal reasoning process before producing a visible response. Thinking tokens consume context budget but can dramatically improve quality on complex tasks. Anthropic models now support interleaved thinking between tool calls, and the thinking budget is configurable.
Token
token economicsThe fundamental unit of text processing for LLMs. A token is roughly 3-4 characters or 0.75 words in English. All API pricing is based on token counts. Understanding tokenization is essential for cost estimation.
Token Budget Dashboard
architectureA real-time visualization showing current token usage across the context window, broken down by category (system prompt, conversation history, file contents, tool outputs). Helps developers stay within limits.
Tool Result Clearing
context managementA lighter form of compaction where the raw results of previous tool calls (like long terminal outputs or file reads) are cleared to save space, while keeping the conclusions and decisions derived from them.
Trimming (Last-N)
memory strategiesA simple context management strategy that keeps only the last N conversation turns and discards older ones. Fast and predictable but loses early context completely. Best for simple chatbots where recent context matters most.
Turn Limit
memory strategiesA configuration parameter that triggers automatic context compaction after a specified number of conversation turns, regardless of token count. Provides predictable compaction intervals for applications with consistent turn sizes.
XML Tagging
prompt engineeringUsing tags like <background_information>, <tool_guidance>, <constraints> to clearly separate different types of instructions in prompts. This structural technique helps models parse complex multi-section prompts more reliably.