Tokens, Context Windows, and why your AI agent feels stupid sometimes

When an AI coding agent forgot a constraint, drifted halfway through a task, or confidently invented things that never existed, some of our first reactions are that “this model is not good enough”.

After using Cursor, Claude Code, and other AI coding agents daily for a long time, I changed my mind.

When a capable model feels dumb, the problem is often not intelligence. It is context, what you gave it, what you did not, and what quietly fell out of the window.

This post is my experience with tokens and context windows, plus techniques to keep AI coding agents stable when tasks get long.

Tokens and context windows

A token is just a chunk of text that the model (such as Claude Opus or GPT Codex) processes.

Sometimes it is a whole word.
Sometimes it is part of a word.
Sometimes it is punctuation.

There are some good, detailed explanations on the internet, so I won’t deep dive into how the tokenizer works. You do not need to know the exact tokenization rules. What matters is that everything becomes tokens in the end: code, comments, logs, chat messages, etc.

You also do not need to count them, but you do need to understand relative cost, for example:

A short chat message is cheap.
A function or interface is moderate.
A full file is expensive.
A long stack trace or repeated logs are very expensive.

The context window is the maximum number of tokens the model can hold in its head at the same time.

Just a simple illustration for understanding how context window affect your ChatGPT conversation. https://t.co/Aqqwx1d7p8 pic.twitter.com/51IUOHbIb6
— Hoang Nguyen (@codeaholicguy) August 11, 2025

If you imagine the model as reading from left to right, the context window is the size of the table it can spread papers on. Once the table is full, new papers push old ones off the edge.

When something falls off the table, the model does not partially remember it. It is simply gone. From that point on, the model can only guess based on patterns it has seen before.

This is why agents can feel sharp at the beginning of a task and unreliable later. It is not that they get tired. It is that the table gets crowded.

The failures you keep seeing

These failures look different, but they usually share the same root cause: context pressure.

Forget

The agent stops following a rule you stated earlier.

Examples:

You said do not change API shape and it changes it anyway.
You said use existing error types and it invents new ones.

What is actually happening under the hood:

Early in the session, your constraint competes with very little. It has strong weight in attention. As the session grows, more tokens arrive, such as diffs, logs, partial refactors, explanations. The original constraint is still technically present, but it is now one small signal inside a much larger pool of text.

Even before the hard window limit is reached, attention dilution happens. The model “sees” the rule, but it no longer treats it as central.

Drift

The agent starts solving a different problem than the one you are solving.

Examples:

It starts refactoring style instead of finishing the feature.
It optimizes performance when correctness is not done.

What is actually happening under the hood:

Language models are next-token predictors. They optimize for locally coherent continuations. If the most recent context heavily features refactoring, performance discussion, or style comments, the model will naturally continue along that trajectory.

Drift is often a local optimum problem. The short-term pattern in context overpowers the long-term goal.

Hallucinate

The agent claims things exist that do not exist.

Examples:

Referencing functions that are close but wrong.
Claiming files were modified when they were not.

What is actually happening under the hood:

When grounding context is incomplete, the model does not return “I do not know” by default. It fills the gap with the most statistically plausible continuation. If you showed it a similar function earlier, it may reconstruct a nearby variant. If you mentioned a file pattern, it may assume another file exists.

This is not malicious behavior. It is pattern completion under uncertainty.

Why this happens more in coding

Coding work has high dependency density.

In normal chat, if one sentence is slightly wrong, the conversation still flows. In code, one missing import breaks compilation. One wrong assumption about a type breaks behavior. One renamed field silently corrupts data.

Coding agents juggle multiple representations simultaneously: source code, diffs, chat, logs, etc. Each of these is token-heavy. More importantly, they have different semantic roles.

When all of them are mixed in a long session, the signal-to-noise ratio drops. The model is not just reading text. It is trying to reconcile structure, behavior, and history at the same time. Even before hitting the hard context limit, attention gets fragmented. The agent may technically “see” the goal, but its attention is pulled toward the most recent diff or the loudest stack trace.

This is why long coding sessions degrade faster than normal chat. It is a challenging problem, and Codex or Cursor shared that their focus is improving the long coding session.

Techniques that actually keep agents stable

I think about context in three layers.

Layer 1: the contract
- This must never be lost.
- This includes goals and non-goals, constraints, acceptance criteria, invariants that must hold.
Layer 2: the working set
- The minimum needed for the current step.
- This includes the files you are editing, the interfaces it depends on.
Layer 3: the noise
- This is everything else.
- This includes old attempts, long logs, unrelated files added just in case.

I will try to optimize the context to reduce the noise as much as I can, while keeping the contract as simple as I can.

Start with a written contract

I do it even if it feels redundant. A contract is not for the model. It is for stabilizing attention.

My hypothesis is that attention is relative. The model weighs tokens against each other. A concise contract placed at the top of the prompt anchors the distribution of attention. It keeps global intent visible even when local diffs and logs get noisy.

When drift happens, do not argue inside a polluted context. I will restate the contract cleanly and continue. If needed, restart with the contract at the top.

Chunk by intent, not by files

I do not run one giant session until done.

I will split by intent, this concept is applied exactly in how I built AI DevKit, with phase splitting.

understand requirement
design approach and plan
implement minimal slice
add tests
polish

Each chunk gets its own small working set and its own definition of done.

Large sessions fail because they mix planning, execution, debugging, and optimization in one attention pool. By chunking by intent, you reduce cognitive load for the model. Each phase has a narrower objective and fewer competing signals.

Convert history into state

When a session gets long, I will pause and ask the agent to produce a state snapshot:

what changed
what is broken
what is next
what constraints still apply

Then, I will start a new thread with:

the minimal files for the next step
the contract
the snapshot

Context compaction is an item in AI DevKit roadmap, and I hope that I can release it soon.

Treat rules as context compression

Rules exist so you do not repeat yourself.

They should encode stable invariants:

naming conventions
architectural boundaries
error handling patterns
preferred libraries
etc

Good rules are small, stable, encode conventions and invariants.

Bad rules are long, include task-specific details, change every session.

You can also leverage files like AGENTS.md or CLAUDE.md as stable instruction anchors.

Instead of repeating architectural principles, coding conventions, and workflow rules in every session, move them into a dedicated, version-controlled instruction file. These files act as persistent context contracts that the agent can load explicitly when needed.

Design tool outputs for context efficiency

Tool outputs still consume tokens, so I will prefer:

structured data over text
summaries over raw logs

This is also why I personally prefer CLI-style integrations over heavy MCP-style preload patterns for many workflows.

When you preload large tool responses into the conversation, you are paying the context cost upfront. The agent must carry that entire payload in its active window, even if it only needs a small portion of it.

A CLI model is different. The agent calls a command when it needs something. The command returns structured output to stdout. The agent consumes exactly what is required for the current step.

This is the spirit behind how I designed AI DevKit memory. Memory is not preloaded into every session. It is a standalone CLI. The agent calls it when needed, gets structured, minimal output, and continues reasoning.

Use multi-agent workflows deliberately

Multi-agent setups multiply context consumption and coordination cost. You are not just paying in tokens. You are paying for reconciliation complexity.

Use parallel agents when uncertainty is high, and you want alternative approaches. For deterministic execution, a single agent with a stable context stack is usually more reliable.

For example, once the plan is stable and the acceptance criteria are clear, I prefer one agent to execute end-to-end.

I do not split separate agents for backend, frontend, etc that stage. That kind of split increases coordination overhead and context duplication. Each agent needs a slightly different slice of the same contract, the same files, the same invariants. You end up paying the context cost multiple times and then reconciling their outputs.

Instead, I keep a single focused agent with a clean contract and a tight working set. It holds the intent consistently from implementation to tests to small refactors.

I only split agents when the problem space itself is separable. For example, different features that do not share tight coupling, or alternative solution explorations where diversity of approach is valuable.

Closing thought

AI tools made generation cheap.

They made information selection expensive.

When the agent feels dumb, do this

Pause and assume context loss first.
Restate the contract.
Reduce the working set.
Ask for a state snapshot.
Continue in a fresh thread.

If you learn to manage context deliberately, your agent will feel stable and consistent. Not because the model changed, but because your workflow did.

If you want to read more about software engineering in the AI era, subscribe to my blog. I will keep sharing what I learn while building systems with AI in the loop.

Discover more from Codeaholicguy

Subscribe to get the latest posts sent to your email.

Tokens and context windows

The failures you keep seeing

Forget

Drift

Hallucinate

Why this happens more in coding

Techniques that actually keep agents stable

Start with a written contract

Chunk by intent, not by files

Convert history into state

Treat rules as context compression

Design tool outputs for context efficiency

Use multi-agent workflows deliberately

Closing thought

Discover more from Codeaholicguy

If you enjoyed this post, feel free to share it!

Comment Cancel reply

Discover more from Codeaholicguy