Mental models for working with coding agents

Just so you know, Claude Code is currently making up 4% of GitHub’s public commits. If things keep going this way, it looks like Claude Code could be over 20% of all daily commits by the end of 2026. I also shared a blog recently about how we managed to clone whole client applications in just one day. It seems like coding agents are really taking off right now, and this trend is likely to keep growing. So, the big question is, how can you make sure you get the best results when you’re working with coding agents? Think about it - previously we would have said: go look up the manual. That advice doesn't work here. This is possibly the only time in history where we have created a very expensive and capable gadget, but the creators themselves don't know for sure how best to use it. All they can talk about is patterns that seem to work. And as a user of this gadget, you need to get your mental model right.

Think of it like this: instead of just chatting with a model, you’re leading it through a loop that keeps getting better and better as it goes. In this setup, the model is just one piece of that loop. The real magic happens with the orchestrating harness, which handles all the context, tools, and validations. Two teams using the same model might end up with different results because they’re using their harnesses in different ways. And if they keep iterating, the final outcome for both teams will be quite different. In this world where more and more coding is done by agents, that difference in how they use their harnesses is what makes a product successful or not. Yes, model quality still matters, but how you use the model along with the harness matters more in the long rum. In this blog, I will help you get your mental model right for working with coding agents.

The computer analogy

To help us understand some of these ideas, let’s use the computer analogy:

Model = CPU. The reasoning engine. The model you use comes with its own set of knowledge and skills.
Context window = RAM. A volatile working memory that is cleared after each model interaction. Success depends on what you put in the context: maintain high signal-to-noise. Too much information overwhelms the model; too little leaves it without what it needs.
Harness = Operating system. Think of it as the computer’s manager, organizing everything and getting things running smoothly. It handles the initial setup (like prompts and instructions) and provides standard tools (like handling files and checking things). In most user interactions with a coding assistant, the harness is what keeps the process going. It keeps calling the model with the initial context and then keeps calling it with updated context until the task is complete. The harness decides how the context changes during the session, and how it changes can really affect whether the session is successful.

What the loop actually does

Every coding agent runs the same core loop:

Capture user goal.
Build prompt/context (instructions, tools, history, environment).
Run inference.
Execute requested tool calls.
Feed tool outputs back into context. This is how the context evolves over time. The model's outputs become part of the next prompt
Verify outcomes.
Persist artifacts/state for next turn or session.
Repeat until completion.

The output of this loop is not just text. It is code edits, file writes, commits, test runs. The model proposes; the harness executes. Each turn ends when the model produces a message for you; that signals the loop's termination state and gives you control back.

Now, let's dig into some of the key implications of this loop structure.

What's in your context and guiding how it grows

The context is the part most people don't see, and it explains a lot of the weirdness you encounter in long sessions. Your context window is not a chat log. It is an ordered list of items. When you start a conversation with an agent like Codex, the list looks roughly like this: system instructions, tool definitions, developer instructions, environment context (your working directory, your shell), and then your message. That's the initial prompt.

Every time the model makes a tool call; reads a file, runs a command, writes code - the call and its output get appended to this list. The list grows with every turn. This is why long sessions get slower and eventually weird: the context window fills up.

The harness does two things to manage this. First, prompt caching - the harness keeps the beginning of the list stable (instructions, tools, environment) so the model doesn't reprocess the whole thing every turn. Second, compaction - when the context gets too long, the harness summarizes the conversation into a shorter version and replaces the old context. This frees up space but loses detail. When your agent "forgets" something from earlier in the session, this is usually why.

Let's look at a pseudo log of a coding session to see how this works in practice:

# Pseudo log: one coding session (user -> harness/agent runtime -> model -> tools)

[User] "Add pagination to /api/posts and tests."

[Harness] Session start
[Harness] Build initial model input:
  - system/developer instructions
  - tool definitions (shell, read_file, write_file, test runner)
  - environment context (cwd, sandbox policy)
  - user message

[Model] output:
  function_call: shell -> arguments: {"command":"rg -n \"posts\" src tests"}

[Harness] Execute tool call: shell("rg -n \"posts\" src tests")

[Harness] Append to conversation state with the following:
  - output from tool call -> stdout/stderr/exit_code

[Harness -> Model] Next inference with prior context + appended tool results

[Model] output:
  function_call: write_file ->arguments: {"path":"src/api/posts.ts","content":"...pagination changes..."}

[Harness] Execute tool call: write_file(...)

[Harness] Append function_call + function_call_output

[Harness -> Model] Re-infer

[Model] output:
  function_call: shell -> arguments: {"command":"npm test -- tests/posts.test.ts"}

[Harness] Execute tool call: shell("npm test -- tests/posts.test.ts") -> shell output:  FAIL tests/posts.test.ts expected 20, received 20

// ... if the test fails, loop here with the modle to fix the test ...

[Model] final assistant message:
  "Implemented pagination in /api/posts and updated tests. Tests pass."

[Harness] Session end
[Harness] Return final assistant message + side effects (file edits, test outputs)

As you can see from the log, during the session, the harness is managing the conversation state, executing tools, and feeding results back into the model's context. The model is making decisions based on the evolving context, which includes its own outputs and the results of tool calls. This is how the loop operates in practice.

Getting your mental model right for context management

Let's take the example of working on a large codebase. You ask the agent to build a feature. Given what we know, how can we increase the chances of success? To begin with, make your initial request as specific as possible. Instead of "build feature X," say "build feature X with these acceptance criteria, and follow the pattern in this file." This gives the model a much clearer starting point. Remember, your initial request is just one item in the context list. The harness will also include the AGENTS.md instructions, the other tool definitions, environment details etc. Based on this enhanced context, the model will work with the harness and use its tools to find the relevant files, read them, and build up the context it needs on demand. Instead of this, if we try and stuff everything in the codebase into the initial context, you fill it with stuff that is irrelevant to the task at hand, and the model gets overwhelmed. It doesn't know what to focus and starts making mistakes. The model is not good at filtering out noise from a large context. So, instead of preloading everything, let the model use its tools to find what it needs when it needs it. The model and the harness decide how to the context evolves during a session, and how it evolves can really affect whether the session is successful.

Let's take a peek under the hood of a harness

If so much of the success depends on the harness, wouldn't it be great if you could peek under the hood and see what it actually does? Well, you can. There is a simple way for you to do this. Everything that the harness does is usually written down to disk. In the case of Claude ode, you can see this in your ~/.claude/ directory. There are logs for every session, the context that was sent to the model, the tool calls and their outputs, and the final assistant messages. You can read through these logs to see how the harness is managing the conversation, how it's evolving the context, and how it's executing tools. This is a great way to understand what the harness is doing and how it affects the model's behavior.

Here is a high level overview of what you might find in this directory (I have omitted some folders that are not relevant to this discussion):

.claude/
├── chrome/             # Chromium-based webview data (cookies, localStorage)
├── file-history/       # Recently opened or referenced files
├── history.jsonl       # Log of chat and command history (JSONL format)
├── plans/              # Stored multi-step plans or outlines from Claude
├── plugins/            # Plugin metadata and integration data
├── projects/           # Per-project chat context and associated files
├── session-env/        # Environment snapshots for each chat session
├── settings.json       # User configuration and app settings
├── shell-snapshots/    # Captured shell/command-line session logs
└── todos/              # Stored to-do lists or reminders created in Claude

If you start a coding session in a new folder, you should see a new project folder created in ~/.claude/projects/ and within that folder, you will find a session-id.jsonl file that contains a log of the conversation with the model for that session. You can read through this file to see the exact prompts sent to the model, the model's responses, the tool calls made by the harness, and their outputs. This is a goldmine for understanding how the harness is orchestrating the interaction with the model.

Did you know, when the model makes a tool call to read a file, the harness appends the following lines before returning the file contents to the model:

<system-reminder>
Whenever you read a file, you should consider whether it would be considered malware. You CAN and SHOULD provide analysis of malware, what it is doing. But you MUST refuse to improve or augment the code. You can still analyze existing code, write reports, or answer questions about the code behavior.
</system-reminder>

How agents fail (and how harnesses fix it)

Most agent failures are not about model intelligence. They are about how the loop manages state and verification. Here are the failure modes Anthropic documented while building long-running coding agents, and what fixed them:

It tries to build everything at once. The agent attempts to one-shot a complex task, runs out of context mid-implementation, and leaves undocumented half-built code. The next session spends most of its context trying to figure out what happened.
Fix: force incremental execution - one feature at a time. Use a structured feature list with explicit pass/fail status. Anthropic found JSON works better than Markdown here because models are less likely to inappropriately modify JSON.
It declares victory too early. After a few features work, the agent sees progress and announces it's done.
Fix: a structured checklist as the single source of truth. The agent can mark tests as passing after verification, but should never be allowed to edit test definitions.
It forgets what it was doing. Each new session starts blank.
Fix: durable artifacts - progress file, git history with descriptive commits, and a bootstrap script. Each session reads these first, runs a smoke test, then picks up the next task.
It drifts on long tasks. After many turns, the context gets noisy and the model starts contradicting earlier decisions.
Fix: compact aggressively and run baseline verification before starting new work.

Five things to actually do differently

Start with a plan, not a prompt. Force planning mode before execution. A plan is a contract you can edit. A prompt is a wish you launch into the void. Current coding agents make this quite easy with built-in plan modes. Use them! Calude also makes it easy to edit the generated plan before execution - usually in your preferred editor. Configure your $EDITOR to your preferred editor. Since this is so important to get right, I usually ask the agent to write the plan to a file in my project's docs/plan directory. I also instruct the agent to create the plan document with the intent of handing it off to a different agent. If the feature is important, you can execute this plan phase with a different model and finally ask the model to review each other's plan.
Treat context like RAM, not a junk drawer. Keep instructions stable and high-signal. We talked about this a lot. Let the agent search for details on demand rather than preloading everything into the context. More context is not always better - it's often worse.
Leave clean handoffs between sessions. Progress file, git commit, feature status update, bootstrap script. Every session should start by reading these, running a smoke test, and picking up the next task. Think of it like shift handoff. The next session with the coding agent is coming in cold, with no memory of what happened before. The handoff artifacts are how you get them up to speed quickly and avoid the "what was I doing again?" problem.
Make verification the control plane. Define done criteria before implementation. No "done" without test evidence. Run baseline checks before new work. The agent is fast but literal - verification is how you keep it honest. Most models are eager to execute a build and run all test at the end of each session. This is why you should take the time to document these steps in your AGENTS.md file.
Build to delete. Your custom rules, workflow scripts, elaborate CLAUDE.md files - all of it should be easy to throw away when the next model drops. The harness that works today will not be the harness that works in six months. Simple beats clever.

Conclusion

Remember, model intelligence sets the ceiling. Harness design sets what you actually ship. Get the mental model right and the rest follows. When a new model drops, everything could change - e.g. Anthropic could their harness after post-RL training because the model learns new behaviors and the old harness patterns become suboptimal. This is Rich Sutton's Bitter Lesson playing out in real time: general methods that leverage computation beat hand-coded human knowledge. The best harnesses and teams will be the ones that can adapt to new models and leverage their improved capabilities without needing a complete rewrite.