The first time an agent rewrites a file in your repo correctly, the impressive part is not the rewrite. It is that the agent knew where to look. That instinct — what to read, what to ignore, what to confirm — is most of the job.
The shape of the problem
A useful coding agent has to make three guesses at once: what the user actually wants, what the codebase already does, and what side-effects a change will trigger. Model size helps with the first. The second and third are about retrieval and feedback.
- Scope: the agent must restrict its attention to files that matter, or it drowns.
- Grounding: it must read, not guess, before suggesting changes.
- Loop: it must observe the result of its own edits — tests, types, runtime — and correct.
Why bigger models alone do not fix this
A bigger model can hold more context, but holding context is not the same as choosing what to put in it. The harness — the loop around the model — is where most of the engineering lives. That is why two products built on the same model can feel a generation apart.
Treat the model as a fast intern with no memory. The system around it is the senior engineer.
What to watch
When evaluating an agentic dev tool, do not stop at the demo. Ask: how does it pick what to read? What does it do when it is wrong? Can you see, and override, its plan? The answers tell you whether the tool will survive contact with a real repo.