Cut AI Agent Behavioral Overhead by 90%

Three compression strategies for the instructions that tell your AI assistant how to think.

You've spent time wiring up an AI coding assistant. You've configured it with behavioral rules: initialize from a memory database, check governance decisions before touching the codebase, log what you did before calling it done. Your instructions are thorough. They're precise. They handle edge cases.

They're also enormous—and they're running up a tab every time your assistant opens its eyes.

This is the hidden cost that most teams building with AI agents eventually stumble into: the instructions for how the agent should behave are themselves consuming the space the agent needs to actually do the work. It's a bit like hiring an expert and giving them such a detailed employee handbook that they spend the whole morning reading it before they can help a single customer.

This post is about three strategies I developed while working on Engrams that collectively cut behavioral instruction overhead by roughly 80–90%, without reducing how well agents follow those instructions.

A Quick Word on How This All Works

Before we dive in, a one-paragraph primer for anyone new to this space.

Tools like Engrams work through something called the Model Context Protocol (MCP) — a standard that lets AI assistants (like the ones built into Cursor, Windsurf, Roo Code, or Claude) call external tools the same way a web browser calls APIs. When you open a project, your AI assistant gets loaded up with a set of instructions: "here's who you are, here's how you behave, here are the tools you can call." That instruction block is called the context window, and it has a size limit — typically around 200,000 tokens (roughly 150,000 words). Every token of instructions you add is one fewer token available for actual code, files, or conversation.

The instructions we're talking about compressing are the behavioral rules that tell the agent how to use the memory tools: when to load context, when to run governance checks, when to log decisions. These rules, in their original form, are long. Very long.

The Problem, in Numbers

The Engrams system ships with a strategy file — a set of behavioral instructions — that agents load at the start of every session. It's structured in YAML (a format popular for configuration files), and it's thorough.

Maybe too thorough.

Each version of the file runs 375–415 lines. And because Engrams supports seven different AI tools (Cursor, Windsurf, Roo Code, Claude Code, Cline, and more), there are seven copies of this file — one customized for each tool's quirks.

These seven files are roughly 90% identical. The differences between them amount to a few lines of tool-specific wording. The other 90% is the same behavioral logic, repeated seven times.

In token terms: each file consumes about 10–12K tokens. On a 200K context window, that's 5–6% of your available space consumed before the user types a single word. Then add the tool schema definitions (also injected at startup), and the results from the initialization tool calls, and you're looking at 25–30K tokens burned on overhead before any productive work begins.

Think of it like renting a storage unit and using the first 15% just to store the shelving units.

Strategy 1: Teach on Demand, Not All at Once

The concept: Replace the 400-line instruction file with a short table of contents. Only load the full details for a section when the agent actually needs it.

This is the "just-in-time" approach — a concept borrowed from manufacturing, where you don't stock warehouse shelves with inventory until a customer actually orders something.

In practice, the 400-line strategy file gets replaced with a 30-line bootstrap stub that looks something like this:

ENGRAMS STRATEGY — Section Directory
  INIT: Check for engrams/context.db. If found → load contexts. If not → offer setup.
  GOVERNANCE: Before mutating workspace → call get_relevant_context(task_description, budget=2000).
  POST_TASK: Before completion → run checks, log outcomes.
    Detail: get_custom_data("engrams_strategy", "post_task")
  SYNC: On "Sync Engrams" → pause, review chat, log all changes.
    Detail: get_custom_data("engrams_strategy", "sync")
  LINKING: Proactively suggest links between discussed Engrams items.
    Detail: get_custom_data("engrams_strategy", "linking")

Notice the Detail: pointers. For simple behaviors (like the initialization check), a one-line summary is enough — the agent doesn't need a five-page explanation to check whether a file exists. For complex, multi-step workflows (like the post-task completion gate), the agent can fetch the full instructions from the server only when that workflow is actually triggered.

The full strategy sections are stored as built-in defaults inside the MCP server itself, not even in the database. They're shipped with the code and returned on demand — no database round-trip needed for the initial fetch.

What this saves:

System prompt drops from ~10,000 tokens to ~800 tokens (the bootstrap + directory)
Per-session, only the sections that actually trigger get loaded: roughly 1,000–3,000 tokens total
Effective reduction: 80–90% in a typical session

The trade-off is one extra tool call when a new strategy section is first needed. But you're already making 7–8 initialization calls per session; a few more for on-demand strategy sections barely registers.

Strategy 2: Say What You Mean, Not Everything You Know

The concept: LLMs understand terse instructions just as well as verbose ones. Cut the scaffolding, not the substance.

Here's a real example. The original YAML for one section — the first-time project setup questionnaire — runs 50 lines:

post_task_setup_questionnaire:
  description: |
    Runs during first-time Engrams setup to configure project-specific verification checks
    that will be automatically executed as part of the post_task_completion gate.
    The agent auto-detects the project type and suggests relevant checks.
  trigger: "During handle_new_engrams_setup, after projectBrief import..."
  steps:
    - step: 1
      action: "Auto-detect project type by scanning workspace root for manifest files."
      detection_rules:
        - indicator: "package.json"
          suggests: ["npm test", "npm run lint", "npm run build", "npx tsc --noEmit"]
          project_type: "Node.js/TypeScript"
        - indicator: "pyproject.toml OR requirements.txt OR setup.py"
          suggests: ["pytest", "flake8 .", "mypy .", "python -m py_compile"]
          project_type: "Python"
        ... (5 more language blocks)

Here's the compressed equivalent — 12 lines:

POST_TASK_SETUP (first-time only, after DB init):
  1. Detect project type from manifest:
     package.json→Node, pyproject.toml/requirements.txt→Python,
     Cargo.toml→Rust, go.mod→Go, pom.xml/build.gradle→Java,
     Gemfile→Ruby, composer.json→PHP
  2. Suggest test/lint/build checks for detected type
  3. Let user toggle/customize checks, set severity (blocking|warning)
  4. Store → log_custom_data(category="post_task_checks", key="verification_commands")
  5. Record → log_decision with tags [post-task-checks, project-setup, dx]

The information content is identical. The model gets the same instruction either way. What changed is the packaging.

A few specific patterns that bloat AI instruction files:

Empty thinking_preamble blocks. The original file had eight of these — YAML keys with a pipe character (|) and no content beneath them. They were placeholders, never filled in. Each one still consumes tokens. Removing them cost nothing and saved hundreds of tokens.

YAML structure applied to non-hierarchical data. Detection rules for project types are a flat lookup table. Representing them as nested YAML objects with indicator, suggests, and project_type keys for each entry uses roughly 4× the tokens of a simple colon-separated mapping on a single line.

Step-by-step descriptions of things the model already knows. "Auto-detect project type by scanning workspace root for manifest files" is a complete sentence that means exactly what "Detect project type from manifest" means. The longer version doesn't add information — it adds tokens.

The analogy here is the difference between a recipe that says "In a large mixing bowl, which should be clean and dry, add the flour" and one that just says "Add flour." A trained chef doesn't need the bowl specification. Neither does a trained model.

Strategy 3: Stop Maintaining Seven Copies of the Same Thing

The concept: Keep one authoritative source, generate the rest.

The seven strategy files for different AI tools are nearly identical. The differences are cosmetic: the name of the tool in the header comment, the variable name for the workspace path, and one or two tool-specific function names.

Instead of maintaining seven full files, the architecture shifts to:

One _core_strategy.yaml containing the shared ~95% of content
Seven thin _delta_*.yaml files containing only the tool-specific overrides:

# _delta_roo.yaml
header_comment: "# Install: Roo Code loads rules from .roo/rules/"
workspace_id_source: "Available as ${workspaceFolder}..."
tool_specific_references:
  ask_question: "Use `ask_followup_question`"
  execute_cmd: "Use `execute_command`"

At build time (or install time), the delta gets merged with the core to produce the final strategy file for each tool. The output is the same — users get a single, complete file for their tool — but the maintenance surface shrinks dramatically.

This matters for a less obvious reason: this architecture makes Strategy 2 (semantic compression) much easier to apply. When you have seven copies, compressing one means compressing seven — and keeping them in sync as the compression evolves. With a single core file, you apply the compression once and all seven outputs benefit automatically.

It's the same principle behind any shared library or abstraction in software development. Stop repeating yourself so you can stop making the same mistake seven times.

What this saves: Nothing on per-user token cost — the output file is the same size. But it cuts maintenance cost dramatically and creates a compounding leverage point for the other strategies.

How the Strategies Stack

These three strategies aren't independent — they compound.

Start with the deduplicated architecture (Strategy 3): you now have one source of truth for the behavioral instructions. Apply semantic compression (Strategy 2) to that source: the compressed version is shorter and clearer. Then deploy the just-in-time loading approach (Strategy 1): instead of injecting the compressed file at startup, serve it from the server on demand, section by section.

The result is a system where:

The cold-start cost (tokens consumed before any user interaction) drops from ~10K to ~800
The total per-session cost depends on which workflows actually trigger, not on the worst-case size of the full instruction set
The maintenance cost of updating behavioral instructions drops to editing one file

In concrete terms: a developer opening a project at the start of a workday used to "spend" 10,000 tokens on behavioral instructions before their AI assistant could respond to their first message. After applying all three strategies, that cost is 800 tokens. The agent gets smarter, faster access to relevant instructions and has more context window available for the actual codebase.

The Bigger Lesson

These strategies are specific to AI agent behavioral instructions, but the underlying principles apply anywhere you're shipping configuration or instructions that need to be compact and maintainable:

Don't load everything upfront. Load what you need, when you need it. This is just lazy evaluation, applied to prompts.
Don't say more than you need to. Structural formatting and verbose phrasing are free in human-written documentation, but expensive in machine-consumed prompts. Write for the model, not for the formatting linter.
Don't repeat yourself across seven files. A shared source with thin per-variant deltas is almost always better than N full copies — for prompts, for config, for anything.

The irony of a memory tool that consumes too much memory to be useful isn't lost on me. These strategies exist because we hit that irony directly and had to work our way out of it.

If you're building with AI agents at any scale, chances are you'll hit it too.

Engrams is an open-source MCP server for persistent AI agent memory and team governance. Source and documentation at engram.sh.

The Goal Was To Cut AI Agent Behavioral Overhead by 90%