Open Models That Code

Chapter One

The two halves

Reading The documentation for Aider, OpenCode, Cline, and Goose on connecting a local or OpenAI-compatible model; and Unit one, Run Your Own Model.

1.1Two pieces, not one

An agentic coding setup is two separable parts, and almost every confusion about "local coding agents" comes from collapsing them into one. The first part is the model: the open weights you met in Unit one, here tuned for code. A model, by itself, is a function from text to text — it answers, and stops. The second part is the harness, or agent: the program that wraps the model in a loop and turns answers into actions. It reads your repository, decides which files matter, sends them to the model with your instruction, takes the diff the model proposes, applies it, runs the tests, and — if something failed — feeds the failure back and goes around again. The model supplies the judgment about what to write; the harness supplies the hands, the eyes, and the persistence.

If that loop sounds familiar, it is the outer loop from the Loop Engineering guide, made concrete for code: a trigger, a prompt, a verification step, and a decision to continue or stop. The model is one component inside it. Which means the quality of your coding agent is not set by the model alone — it is set at least as much by the harness around it: how it slices context, whether it forces a diff, whether something in the loop can say no.

1.2The endpoint that joins them

What lets you treat the two halves as independent is a piece of plumbing: the OpenAI-compatible endpoint. When a chat API became widely adopted, its request-and-response format became a de facto standard, and the open ecosystem standardized on it. Ollama exposes one (at http://localhost:11434/v1); so do LM Studio, vLLM, and SGLang. And the open coding agents are built to point at any endpoint speaking that format. The practical result is a clean seam: the harness sends a standard request, the model server answers in the standard shape, and neither needs to know anything specific about the other.

Figure 1. The two halves, joined by a standard socket. The harness drives the loop; the model supplies the judgment; the OpenAI-compatible endpoint is the seam between them. Because the seam is standard, you can swap either side without touching the other — and because both sit inside your machine, the codebase never crosses out.

1.3Why this is the sovereign move

Put the two halves on your own hardware and connect them over that local socket, and you have something the cloud arrangement can't offer: a coding assistant whose every input — your prompt, your files, your secrets, the model's edits — stays inside your machine. For a hobby project that is a nicety. For a team under a data-residency rule, a regulated industry, or anyone holding source that is not theirs to send away, it is the requirement. And because both halves are open, there is a second kind of sovereignty underneath the first: no vendor can switch your stack off, change its price, or retire it from under you. An open agent under a permissive license can be forked; a closed one is a dependency you don't control. We will see in Chapter 3 that this is not hypothetical.

Check your understanding

Take one coding task you'd want help with and name the two halves you'd use for it: which model, and which harness. If you can't yet name them, that's fine — the next two chapters are exactly those two lists. The point to hold now is that they're separate choices, joined by a standard socket, and you get to make each one independently.

Chapter Two

The open coding models

Reading Z.ai's GLM-5.2 docs; the Hugging Face model cards for MiniMax M3, DeepSeek V4, Kimi K2, and Qwen3-Coder; the independent llm-stats tracker and the NIST CAISI evaluation of DeepSeek V4.

2.1The roster

A handful of open-weight models are genuinely strong at agentic coding as of mid-2026, almost all of them from Chinese labs, almost all mixture-of-experts designs. The short list worth knowing:

Model	Params (total / active)	License	A coding number*
GLM-5.2 (Z.ai)	744B / 40B	MIT	Terminal-Bench 2.1 81.0; SWE-bench Pro 62.1
DeepSeek V4-Pro	1.6T / 49B	MIT	SWE-bench Verified 80.6 (independently tracked)
Kimi K2.x (Moonshot)	1T / 32B	Modified MIT	SWE-bench Pro 58.6 (vendor)
MiniMax M3	428B / 23B	custom (see §2.2)	SWE-bench Pro 59.0 (vendor)
Qwen3-Coder (Alibaba)	up to 480B / 35B	Apache 2.0	the dependable self-host default
Devstral Small (Mistral)	~24B dense	Apache 2.0	runs on one 24 GB GPU

*Scores as of June 2026, and most are vendor-reported — see §2.4. The field moves monthly; verify against the primary before quoting.

GLM-5.2 is the one most practitioners reach for first: on Z.ai's own published numbers it is the strongest open-source coding model, scoring 81.0 on Terminal-Bench 2.1 — within a few points of Claude Opus 4.8 — and 62.1 on SWE-bench Pro, ahead of GPT-5.5. DeepSeek V4-Pro is its closest rival and has the distinction of an independently tracked 80.6 on SWE-bench Verified, the highest open-weight entry. The rest trade blows depending on the benchmark — which is the whole problem with benchmarks, and the subject of §2.4.

2.2Read the license, not the leaderboard

Unit one's hardest-won lesson applies double here: the word "open" on a download page tells you almost nothing, and the license file tells you everything. Most of this roster is genuinely permissive — GLM-5.2 and DeepSeek under MIT, Qwen3-Coder and Devstral under Apache 2.0, Kimi under a Modified MIT that allows commercial use with attribution. But MiniMax M3 is the cautionary specimen. Its weights are downloadable, but its Hugging Face card carries a custom "MiniMax" license, not MIT — and following the precedent of its immediate predecessor, commercial use may require written authorization. By the spectrum from Unit one, that places M3 in the Weights-Available tier, not open source, despite the open-weight billing. The model may still be the right tool; the point is that you only learn the constraint by reading the license, never the announcement.

Figure 2. Why these enormous models are runnable at all. A narrow band marks the active-parameter range the mixture-of-experts models cluster in — roughly 20 to 50 billion — while their total sizes spread more than sixty-fold. Active parameters set the speed and stay tight; total parameters set the memory you must fit and vary enormously. Devstral, a dense 24B, sits on the line where total and active are the same.

2.3Why these are runnable at all

A 1.6-trillion-parameter model sounds impossible to self-host, and a dense one would be. The reason this roster is runnable is the mixture-of-experts design from Unit one: each model quotes a huge total and a small active count, and a router uses only the active fraction per token. The figure above is the whole point — totals range from Devstral's 24 billion to DeepSeek's 1.6 trillion, a sixty-fold spread, while the active counts cluster tightly between about 20 and 50 billion. Active governs the per-token compute (the speed); total governs how much memory you must hold (the fit). So a frontier open coder needs a lot of memory but is quicker per token than its size suggests — and Devstral Small, a dense 24B, is the floor that runs on a single consumer card. That gap between total and active is exactly what makes the hardware question of Chapter 4 tractable.

2.4Reading the benchmarks

Every coding number in this chapter is a claim until you know who measured it and how. Most headline scores — SWE-bench Pro, Terminal-Bench — are vendor-run, produced by the lab releasing the model, on a harness of its choosing. That doesn't make them false, but it makes them marketing until someone independent reproduces them. So prefer the independent sources: trackers like llm-stats that run the harness themselves, the official swebench.com leaderboard, and — best of all — a neutral evaluator with nothing to sell. On that last count, the most useful single primary in this whole space is NIST's CAISI evaluation, which put DeepSeek V4's capabilities about eight months behind the contemporaneous frontier on its own benchmark suite. Treat a vendor's bars the way Unit one treated a launch chart: a starting hypothesis, not a finding.

A wrinkle worth naming plainly: in February 2026 Anthropic publicly accused Moonshot — and named DeepSeek and MiniMax — of training their models in part on Claude's outputs through fraudulent accounts. These are accusations, not established findings, and they remain disputed. But they sit in the background of this roster, and they raise a real question for anyone building on these weights: sovereignty over a model you run locally does not settle where its capability originally came from.

Practice

Pick one model from the roster and open its actual Hugging Face card. Find three things: its total and active parameter counts, its license name (not the word "open" — the license), and whether the coding score it advertises was measured by the lab or by someone independent. If you can't find an independent number, that absence is itself the finding.

Chapter Three

The harnesses

Reading The repositories and docs for Aider, OpenCode, Cline, and Goose; license details cross-checked against each project's repo.

3.1One endpoint, many faces

The open coding agents look like a crowded, confusing field, but they sort cleanly once you see the shared seam. Every one worth running drives the model over the same OpenAI-compatible endpoint from Chapter 1, so the real differences are about where they live and how they work, not which models they support — they all support yours. The split that matters is terminal versus editor: some agents are command-line tools that work directly on your files and git history, and some are extensions that live inside your editor with access to its context.

Figure 3. One socket, many faces. Terminal agents and editor extensions differ in where they live and how they work, but they all drive the model through the same OpenAI-compatible endpoint — so your choice of agent and your choice of model are independent. Everything below the boundary line runs on your hardware.

3.2The four to know

Four open agents cover most needs, all under permissive licenses. Aider (Apache 2.0) is the git-native terminal tool: it commits every AI change as a clean, reviewable diff, which is exactly what you want on a serious codebase where the audit trail matters. OpenCode (MIT) is the breakout — the most-starred open coding agent on GitHub by a wide margin — a terminal interface that speaks to dozens of providers and any local endpoint. Cline (Apache 2.0) is the editor option, a VS Code and JetBrains extension with explicit plan-then-act modes and per-change approval, for people who want the agent inside their editor with the open files in view. Goose (Apache 2.0), from Block, is a CLI-and-desktop agent that now lives under the Linux Foundation's Agentic AI Foundation — a governance signal that the open-agent layer is consolidating around neutral ground. Pick by workflow: git-centric terminal work points to Aider, a provider-broad terminal to OpenCode, editor-embedded work to Cline.

3.3The sovereignty dividend

The reason the license on your agent matters as much as the license on your model became concrete in 2026, when Continue — a widely used open coding extension — was acquired by Cursor, its repository frozen to read-only with a final release and no further roadmap. For users of a closed tool, that is the end of the line. But Continue was Apache 2.0, which means the code can be read, forked, and carried forward by the community regardless of what the company does. That is the whole argument for an open stack in one event: a closed agent is a dependency someone else can retire, reprice, or lock; an open one under a permissive license cannot be switched off from outside. Own the tools, and no acquisition is your problem.

Practice

Install one of the four — Aider if you live in a terminal and git, Cline if you live in an editor — but don't point it at a model yet. Just read its documentation for the line that explains how to set a custom or local endpoint. Finding that one configuration setting is the whole trick of this unit: it's the seam where your Chapter 2 model plugs into your Chapter 3 agent.

Chapter Four

The sovereign stack

Reading The local-endpoint sections of the agent docs above; the vLLM and SGLang serving docs; and Unit one's hardware chapter, Run Your Own Model.

4.1The pattern that works

The fully-local coding stack is, mechanically, simple: serve an open coding model behind a local endpoint, then point an open agent at it. For the smaller models, Ollama is enough — pull the model and it serves the OpenAI-compatible endpoint automatically, exactly as in Unit one. For the large mixture-of-experts models, you step up to a purpose-built serving engine — vLLM or SGLang — which handle the batching, memory paging, and throughput that the big models need to be usable. Either way the shape is identical and the property is the one that matters: the agent talks to localhost, and your prompts, your context, and your code never cross the boundary of your machine.

Figure 4. The sovereign coding loop. The harness reads the repo, asks the local model for an edit, applies it, runs the tests, and goes around again — the whole cycle inside the boundary of your machine. Compare Unit one's data-flow figure: same boundary, now with code and tests in the loop instead of a single chat turn.

4.2The hardware reality

Here honesty matters, because the coding models are big. The frontier open coders — GLM-5.2, DeepSeek V4, Kimi, MiniMax M3 — are the heavy end of Unit one's hardware ladder: their trillion-scale totals must fit in memory even though only tens of billions activate, so usable speed at long context means a multi-GPU server or a strong workstation, plus quantization. This is not laptop work for frontier quality. But the ladder has a real floor, and it is the equity story again: Devstral Small at 24B and a quantized Qwen3-Coder around 32B run a genuinely useful coding agent on a single 24 GB consumer GPU. You do not get GLM-5.2's long-horizon strength there — but you get a private, capable, daily-driver coding assistant for the cost of one used card, with your source staying home.

4.3The friction, honestly

Three frictions are worth naming before you commit, because the closed agents hide all of them. Speed: a long agent loop on a big local model is slower than the cloud, and that compounds across the dozens of turns a real task takes. Token burn: coding agents consume context fast — reading files, running tests, re-reading — and a long-context workload competes with the model for the same memory. Setup: serving a large MoE with vLLM is real systems work, well beyond Ollama's one-line pull. None of these is a dealbreaker, but together they are the price of sovereignty, and a guide that pretended otherwise would be selling you something.

Practice

Take the model you'd most like to code with and place it on Unit one's hardware ladder: estimate its four-bit size, and decide honestly whether it fits the machine you have. If it doesn't, drop to the floor — a 24B agent-tuned model on one card — and decide whether that's enough for your actual daily work before you spend anything. Choosing the model to the machine, on purpose, is the whole exercise.

Chapter Five

The honest gap

Reading Are "Solved Issues" in SWE-bench Really Solved Correctly? (ICSE 2026) and the SWE-ABS re-evaluation; the NIST CAISI evaluation; and the companion guide Who the Agent Rewards.

5.1Where open is competitive, and where it lags

The honest picture is neither "open has won" nor "open is a toy." On scoped work — a well-specified task, a bounded change, with tests to check it and a human reviewing — a top open model in a good harness is competitive with, and sometimes preferable to, a closed agent, at a fraction of the cost. Where open still trails the closed frontier is the hard, unbounded end: sustained planning over hundreds of steps, recovering gracefully from its own errors instead of piling on more, holding a large codebase in mind without perfect retrieval, and handling ambiguous or novel requirements without drifting. These are exactly the capabilities that separate a demo from a dependable engineer, and they are where the closed labs' post-training still shows.

5.2What the benchmarks hide

Before trusting any of the scores in Chapter 2, you have to know what they don't measure — and this is the most important thing in the unit. The benchmarks everyone quotes, SWE-bench and its Verified subset, judge a fix by whether it passes the repository's tests. But tests are not exhaustive, so a patch can pass them and still be wrong. A peer-reviewed study at ICSE 2026 examined this directly and found the test oracles weak enough to inflate reported resolution rates by several points; a follow-up re-evaluation that hardened the tests found that roughly one in five of the "solved" patches from the top agents were semantically incorrect — they passed only because the tests failed to catch their errors. When those bad patches were rejected, the top agent's score fell from 78.8% to 62.2%.

Figure 5. What a headline score hides. The reported number counts every patch that passed the tests; the re-evaluated number, after a hardened test suite rejected the patches that passed but were semantically wrong, is sixteen points lower. The gap is not unique to open models — it applies to the closed leaders too. It is a reason to trust independent evaluation over any leaderboard, and your own tests over both.

5.3The reconciliation

This is the same lesson the companion guide Who the Agent Rewards reaches from the other direction. A coding model is the input; the judgment about whether its output is actually right is supplied by the human and the harness, not the model. The jagged frontier means some coding tasks fall easily within an open model's reach and others, of similar apparent difficulty, fall just outside it — and you usually cannot tell which from the outside without expertise in the problem. So the verification burden is load-bearing, and it grows as you let the agent run longer unattended. The sober external anchor for all of this is NIST's CAISI evaluation, which placed the leading open model about eight months behind the closed frontier — a real, measured gap, neither dismissive nor triumphant.

5.4The hybrid most people actually run

In practice, the honest answer for many teams is not all-local or all-cloud but a split: run a local open model for the routine majority of coding work — the scoped edits, the boilerplate, the well-specified fixes — and escalate the genuinely hard, ambiguous, long-horizon tasks to a closed frontier agent. That keeps most of your code on your own hardware and most of your spend near zero, while reserving the expensive closed model for the slice where it still clearly wins. Sovereignty is not all-or-nothing; it is a dial, and the hybrid is where most of the dial sits today.

Check your understanding

Think of two coding tasks from your own work: one scoped, well-specified, and checkable by tests, and one open-ended, ambiguous, and spanning many files. Which would you trust a local open model to do, and which would you escalate? The line between them is your jagged frontier — and naming it is what lets you run the hybrid on purpose instead of discovering its edges by accident.

Chapter Six

When local wins

Reading Synthesis of Chapters 1–5; the first-run docs for Ollama and Aider; and Unit one's decision chapter, Run Your Own Model.

6.1A local coding loop, end to end

The shortest path from nothing to a private coding agent is a handful of commands. Pull an agent-tuned model that fits your hardware, point an agent at it, and you are coding with nothing leaving your machine:

# pull an agent-tuned coding model that fits one 24 GB GPU
ollama pull devstral-small:24b

# drive it with Aider — git-native, commits each change as a clean diff
pip install aider-chat
aider --model ollama/devstral-small:24b

# or a terminal agent: point OpenCode at the same local endpoint
OPENAI_BASE_URL=http://localhost:11434/v1 opencode

From nothing to a private, local coding agent.

Then run the same test Unit one ended on: disconnect from the network and give the agent a small, scoped task. If it reads your files, proposes an edit, and you can run your tests against it with no internet at all, you have a coding assistant that is entirely yours — both halves, on your hardware, your source never crossing out.

6.2The decision

When is that the right call rather than reaching for a closed agent? The same fork as Unit one, with one coding-specific turn. If your code is sovereign or regulated — it cannot leave — the decision is made, and you run local, full stop. If it can leave, then task-shape decides: scoped, routine, well-specified work favors a local open model; long-horizon, ambiguous, many-file work still favors a closed agent, or the hybrid. Cost rarely decides on its own, because self-hosting trades a per-token bill for a hardware-and-time bill — the open tokens are cheap, but the machine and the hours running it are not free, so price your own time honestly.

Figure 6. The coding decision, as two questions. Sovereignty answers first and absolutely — regulated or private code stays local and the cost comparison never happens. Only when the code can leave does task-shape decide: scoped and checkable favors a local open model; long-horizon and ambiguous still favors a closed agent or the hybrid.

6.3Own the tools

That is the second floor of the cluster. You can now read a coding model's license and benchmarks with a clear eye, drive it with an open agent you control, keep the whole loop on your own hardware, and judge honestly when to do so. Unit one was own the model; this was own the tools. The arc closes in Unit three, own the stack, which reaches past text into the open models for images, audio, and embeddings — on the same hardware, with the same sovereignty, and the same honest accounting of where open is ready and where it is not. The thread through all three is unchanged: own your tools rather than rent them, and keep the authority over your work where it belongs.

Practice · put it together

Take one real task from your week and run it through the whole unit: pick a model from Chapter 2 sized to your hardware from Chapter 4, drive it with an agent from Chapter 3, walk it through the Chapter 6 decision, and name the one verification from Chapter 5 you would never skip. If you reach the end and the honest answer is "this one should go to a closed agent," that is a correct outcome — knowing which tasks are not yet local work is exactly the judgment this unit was for.

Open Models That CodeJune 2026

The two halves

1.1Two pieces, not one

1.2The endpoint that joins them

1.3Why this is the sovereign move

The open coding models

2.1The roster

2.2Read the license, not the leaderboard

2.3Why these are runnable at all

2.4Reading the benchmarks

The harnesses

3.1One endpoint, many faces

3.2The four to know

3.3The sovereignty dividend

The sovereign stack

4.1The pattern that works

4.2The hardware reality

4.3The friction, honestly

The honest gap

5.1Where open is competitive, and where it lags

5.2What the benchmarks hide

5.3The reconciliation

5.4The hybrid most people actually run

When local wins

6.1A local coding loop, end to end

6.2The decision

6.3Own the tools

Glossary