Learn/Open Models That Code
Equity & Sovereignty

Open Models That CodeJune 2026

Pair an open model with an open coding agent and you can run a capable engineering assistant on your own hardware — codebase never leaving the building. What the models are, what drives them, and where they still fall short of the closed frontier.

This is the second of three units on running open models on your own hardware: own the model, own the tools, own the stack. Unit one stood up a local model for chat; this one puts a model to work writing code, inside an agent you also control. It draws on the labs' own model cards and licenses, the documentation of the open coding agents, independent benchmark trackers and a NIST evaluation, and peer-reviewed work on what the coding benchmarks actually measure. The field moves monthly, so every model, score, and price is dated; treat them as true as of June 2026 and check the primary before you quote them.

In Plain Terms

In the first unit, running your own model meant having a private conversation with an AI on your own computer. This unit is about putting that same idea to work on code — letting a model you host read your codebase, write changes, run the tests, and fix what broke, all without your source ever leaving your machine.

Doing that takes two separate pieces, and keeping them separate is the key to the whole thing. The first is the model — the trained weights, the raw intelligence, the same kind of open model from Unit one but tuned for code. The second is the harness, also called the agent: a program that drives the model in a loop. The model on its own only answers questions; the harness is what turns answers into action — it decides which files to open, feeds them to the model, takes the model's proposed edit, applies it, runs the tests, and loops back if something failed. The model is the engine; the harness is the driver.

The reason they stay separate is a small piece of plumbing called an OpenAI-compatible endpoint — a common format, like a standard wall socket, that almost every local model server speaks and almost every open agent knows how to plug into. Because of that socket, you can pair any open model with any open agent and swap either one without rewiring the other. The same local server from Unit one that answered your chat questions can now sit behind a coding agent instead.

Two honest truths frame the rest. The good news is that the open coding models have come a remarkably long way: the best of them now sit a short step behind the best closed coding agents, at a fraction of the cost, and you can download and run them. The sobering news is that "a short step behind" is real and uneven — open models still trail on the longest, most ambiguous engineering work, and the benchmark scores everyone quotes turn out to overstate how good any of these models, open or closed, really are. So this unit is both a build guide and a reality check. By the end you'll know which open models are worth running, which agent to drive them with, how to keep the whole loop on your own hardware, and — just as important — where that loop will let you down.

Chapter One

The two halves


Reading The documentation for Aider, OpenCode, Cline, and Goose on connecting a local or OpenAI-compatible model; and Unit one, Run Your Own Model.

1.1Two pieces, not one

An agentic coding setup is two separable parts, and almost every confusion about "local coding agents" comes from collapsing them into one. The first part is the model: the open weights you met in Unit one, here tuned for code. A model, by itself, is a function from text to text — it answers, and stops. The second part is the harness, or agent: the program that wraps the model in a loop and turns answers into actions. It reads your repository, decides which files matter, sends them to the model with your instruction, takes the diff the model proposes, applies it, runs the tests, and — if something failed — feeds the failure back and goes around again. The model supplies the judgment about what to write; the harness supplies the hands, the eyes, and the persistence.

If that loop sounds familiar, it is the outer loop from the Loop Engineering guide, made concrete for code: a trigger, a prompt, a verification step, and a decision to continue or stop. The model is one component inside it. Which means the quality of your coding agent is not set by the model alone — it is set at least as much by the harness around it: how it slices context, whether it forces a diff, whether something in the loop can say no.

1.2The endpoint that joins them

What lets you treat the two halves as independent is a piece of plumbing: the OpenAI-compatible endpoint. When a chat API became widely adopted, its request-and-response format became a de facto standard, and the open ecosystem standardized on it. Ollama exposes one (at http://localhost:11434/v1); so do LM Studio, vLLM, and SGLang. And the open coding agents are built to point at any endpoint speaking that format. The practical result is a clean seam: the harness sends a standard request, the model server answers in the standard shape, and neither needs to know anything specific about the other.

1.3Why this is the sovereign move

Put the two halves on your own hardware and connect them over that local socket, and you have something the cloud arrangement can't offer: a coding assistant whose every input — your prompt, your files, your secrets, the model's edits — stays inside your machine. For a hobby project that is a nicety. For a team under a data-residency rule, a regulated industry, or anyone holding source that is not theirs to send away, it is the requirement. And because both halves are open, there is a second kind of sovereignty underneath the first: no vendor can switch your stack off, change its price, or retire it from under you. An open agent under a permissive license can be forked; a closed one is a dependency you don't control. We will see in Chapter 3 that this is not hypothetical.

Check your understanding

Take one coding task you'd want help with and name the two halves you'd use for it: which model, and which harness. If you can't yet name them, that's fine — the next two chapters are exactly those two lists. The point to hold now is that they're separate choices, joined by a standard socket, and you get to make each one independently.

Chapter Two

The open coding models


Reading Z.ai's GLM-5.2 docs; the Hugging Face model cards for MiniMax M3, DeepSeek V4, Kimi K2, and Qwen3-Coder; the independent llm-stats tracker and the NIST CAISI evaluation of DeepSeek V4.

2.1The roster

A handful of open-weight models are genuinely strong at agentic coding as of mid-2026, almost all of them from Chinese labs, almost all mixture-of-experts designs. The short list worth knowing:

ModelParams (total / active)LicenseA coding number*
GLM-5.2 (Z.ai)744B / 40BMITTerminal-Bench 2.1 81.0; SWE-bench Pro 62.1
DeepSeek V4-Pro1.6T / 49BMITSWE-bench Verified 80.6 (independently tracked)
Kimi K2.x (Moonshot)1T / 32BModified MITSWE-bench Pro 58.6 (vendor)
MiniMax M3428B / 23Bcustom (see §2.2)SWE-bench Pro 59.0 (vendor)
Qwen3-Coder (Alibaba)up to 480B / 35BApache 2.0the dependable self-host default
Devstral Small (Mistral)~24B denseApache 2.0runs on one 24 GB GPU

*Scores as of June 2026, and most are vendor-reported — see §2.4. The field moves monthly; verify against the primary before quoting.

GLM-5.2 is the one most practitioners reach for first: on Z.ai's own published numbers it is the strongest open-source coding model, scoring 81.0 on Terminal-Bench 2.1 — within a few points of Claude Opus 4.8 — and 62.1 on SWE-bench Pro, ahead of GPT-5.5. DeepSeek V4-Pro is its closest rival and has the distinction of an independently tracked 80.6 on SWE-bench Verified, the highest open-weight entry. The rest trade blows depending on the benchmark — which is the whole problem with benchmarks, and the subject of §2.4.

2.2Read the license, not the leaderboard

Unit one's hardest-won lesson applies double here: the word "open" on a download page tells you almost nothing, and the license file tells you everything. Most of this roster is genuinely permissive — GLM-5.2 and DeepSeek under MIT, Qwen3-Coder and Devstral under Apache 2.0, Kimi under a Modified MIT that allows commercial use with attribution. But MiniMax M3 is the cautionary specimen. Its weights are downloadable, but its Hugging Face card carries a custom "MiniMax" license, not MIT — and following the precedent of its immediate predecessor, commercial use may require written authorization. By the spectrum from Unit one, that places M3 in the Weights-Available tier, not open source, despite the open-weight billing. The model may still be the right tool; the point is that you only learn the constraint by reading the license, never the announcement.

2.3Why these are runnable at all

A 1.6-trillion-parameter model sounds impossible to self-host, and a dense one would be. The reason this roster is runnable is the mixture-of-experts design from Unit one: each model quotes a huge total and a small active count, and a router uses only the active fraction per token. The figure above is the whole point — totals range from Devstral's 24 billion to DeepSeek's 1.6 trillion, a sixty-fold spread, while the active counts cluster tightly between about 20 and 50 billion. Active governs the per-token compute (the speed); total governs how much memory you must hold (the fit). So a frontier open coder needs a lot of memory but is quicker per token than its size suggests — and Devstral Small, a dense 24B, is the floor that runs on a single consumer card. That gap between total and active is exactly what makes the hardware question of Chapter 4 tractable.

2.4Reading the benchmarks

Every coding number in this chapter is a claim until you know who measured it and how. Most headline scores — SWE-bench Pro, Terminal-Bench — are vendor-run, produced by the lab releasing the model, on a harness of its choosing. That doesn't make them false, but it makes them marketing until someone independent reproduces them. So prefer the independent sources: trackers like llm-stats that run the harness themselves, the official swebench.com leaderboard, and — best of all — a neutral evaluator with nothing to sell. On that last count, the most useful single primary in this whole space is NIST's CAISI evaluation, which put DeepSeek V4's capabilities about eight months behind the contemporaneous frontier on its own benchmark suite. Treat a vendor's bars the way Unit one treated a launch chart: a starting hypothesis, not a finding.

A wrinkle worth naming plainly: in February 2026 Anthropic publicly accused Moonshot — and named DeepSeek and MiniMax — of training their models in part on Claude's outputs through fraudulent accounts. These are accusations, not established findings, and they remain disputed. But they sit in the background of this roster, and they raise a real question for anyone building on these weights: sovereignty over a model you run locally does not settle where its capability originally came from.

Practice

Pick one model from the roster and open its actual Hugging Face card. Find three things: its total and active parameter counts, its license name (not the word "open" — the license), and whether the coding score it advertises was measured by the lab or by someone independent. If you can't find an independent number, that absence is itself the finding.

Chapter Three

The harnesses


Reading The repositories and docs for Aider, OpenCode, Cline, and Goose; license details cross-checked against each project's repo.

3.1One endpoint, many faces

The open coding agents look like a crowded, confusing field, but they sort cleanly once you see the shared seam. Every one worth running drives the model over the same OpenAI-compatible endpoint from Chapter 1, so the real differences are about where they live and how they work, not which models they support — they all support yours. The split that matters is terminal versus editor: some agents are command-line tools that work directly on your files and git history, and some are extensions that live inside your editor with access to its context.

3.2The four to know

Four open agents cover most needs, all under permissive licenses. Aider (Apache 2.0) is the git-native terminal tool: it commits every AI change as a clean, reviewable diff, which is exactly what you want on a serious codebase where the audit trail matters. OpenCode (MIT) is the breakout — the most-starred open coding agent on GitHub by a wide margin — a terminal interface that speaks to dozens of providers and any local endpoint. Cline (Apache 2.0) is the editor option, a VS Code and JetBrains extension with explicit plan-then-act modes and per-change approval, for people who want the agent inside their editor with the open files in view. Goose (Apache 2.0), from Block, is a CLI-and-desktop agent that now lives under the Linux Foundation's Agentic AI Foundation — a governance signal that the open-agent layer is consolidating around neutral ground. Pick by workflow: git-centric terminal work points to Aider, a provider-broad terminal to OpenCode, editor-embedded work to Cline.

3.3The sovereignty dividend

The reason the license on your agent matters as much as the license on your model became concrete in 2026, when Continue — a widely used open coding extension — was acquired by Cursor, its repository frozen to read-only with a final release and no further roadmap. For users of a closed tool, that is the end of the line. But Continue was Apache 2.0, which means the code can be read, forked, and carried forward by the community regardless of what the company does. That is the whole argument for an open stack in one event: a closed agent is a dependency someone else can retire, reprice, or lock; an open one under a permissive license cannot be switched off from outside. Own the tools, and no acquisition is your problem.

Practice

Install one of the four — Aider if you live in a terminal and git, Cline if you live in an editor — but don't point it at a model yet. Just read its documentation for the line that explains how to set a custom or local endpoint. Finding that one configuration setting is the whole trick of this unit: it's the seam where your Chapter 2 model plugs into your Chapter 3 agent.

Chapter Four

The sovereign stack


Reading The local-endpoint sections of the agent docs above; the vLLM and SGLang serving docs; and Unit one's hardware chapter, Run Your Own Model.

4.1The pattern that works

The fully-local coding stack is, mechanically, simple: serve an open coding model behind a local endpoint, then point an open agent at it. For the smaller models, Ollama is enough — pull the model and it serves the OpenAI-compatible endpoint automatically, exactly as in Unit one. For the large mixture-of-experts models, you step up to a purpose-built serving engine — vLLM or SGLang — which handle the batching, memory paging, and throughput that the big models need to be usable. Either way the shape is identical and the property is the one that matters: the agent talks to localhost, and your prompts, your context, and your code never cross the boundary of your machine.

4.2The hardware reality

Here honesty matters, because the coding models are big. The frontier open coders — GLM-5.2, DeepSeek V4, Kimi, MiniMax M3 — are the heavy end of Unit one's hardware ladder: their trillion-scale totals must fit in memory even though only tens of billions activate, so usable speed at long context means a multi-GPU server or a strong workstation, plus quantization. This is not laptop work for frontier quality. But the ladder has a real floor, and it is the equity story again: Devstral Small at 24B and a quantized Qwen3-Coder around 32B run a genuinely useful coding agent on a single 24 GB consumer GPU. You do not get GLM-5.2's long-horizon strength there — but you get a private, capable, daily-driver coding assistant for the cost of one used card, with your source staying home.

4.3The friction, honestly

Three frictions are worth naming before you commit, because the closed agents hide all of them. Speed: a long agent loop on a big local model is slower than the cloud, and that compounds across the dozens of turns a real task takes. Token burn: coding agents consume context fast — reading files, running tests, re-reading — and a long-context workload competes with the model for the same memory. Setup: serving a large MoE with vLLM is real systems work, well beyond Ollama's one-line pull. None of these is a dealbreaker, but together they are the price of sovereignty, and a guide that pretended otherwise would be selling you something.

Practice

Take the model you'd most like to code with and place it on Unit one's hardware ladder: estimate its four-bit size, and decide honestly whether it fits the machine you have. If it doesn't, drop to the floor — a 24B agent-tuned model on one card — and decide whether that's enough for your actual daily work before you spend anything. Choosing the model to the machine, on purpose, is the whole exercise.

Chapter Five

The honest gap


Reading Are "Solved Issues" in SWE-bench Really Solved Correctly? (ICSE 2026) and the SWE-ABS re-evaluation; the NIST CAISI evaluation; and the companion guide Who the Agent Rewards.

5.1Where open is competitive, and where it lags

The honest picture is neither "open has won" nor "open is a toy." On scoped work — a well-specified task, a bounded change, with tests to check it and a human reviewing — a top open model in a good harness is competitive with, and sometimes preferable to, a closed agent, at a fraction of the cost. Where open still trails the closed frontier is the hard, unbounded end: sustained planning over hundreds of steps, recovering gracefully from its own errors instead of piling on more, holding a large codebase in mind without perfect retrieval, and handling ambiguous or novel requirements without drifting. These are exactly the capabilities that separate a demo from a dependable engineer, and they are where the closed labs' post-training still shows.

5.2What the benchmarks hide

Before trusting any of the scores in Chapter 2, you have to know what they don't measure — and this is the most important thing in the unit. The benchmarks everyone quotes, SWE-bench and its Verified subset, judge a fix by whether it passes the repository's tests. But tests are not exhaustive, so a patch can pass them and still be wrong. A peer-reviewed study at ICSE 2026 examined this directly and found the test oracles weak enough to inflate reported resolution rates by several points; a follow-up re-evaluation that hardened the tests found that roughly one in five of the "solved" patches from the top agents were semantically incorrect — they passed only because the tests failed to catch their errors. When those bad patches were rejected, the top agent's score fell from 78.8% to 62.2%.

5.3The reconciliation

This is the same lesson the companion guide Who the Agent Rewards reaches from the other direction. A coding model is the input; the judgment about whether its output is actually right is supplied by the human and the harness, not the model. The jagged frontier means some coding tasks fall easily within an open model's reach and others, of similar apparent difficulty, fall just outside it — and you usually cannot tell which from the outside without expertise in the problem. So the verification burden is load-bearing, and it grows as you let the agent run longer unattended. The sober external anchor for all of this is NIST's CAISI evaluation, which placed the leading open model about eight months behind the closed frontier — a real, measured gap, neither dismissive nor triumphant.

5.4The hybrid most people actually run

In practice, the honest answer for many teams is not all-local or all-cloud but a split: run a local open model for the routine majority of coding work — the scoped edits, the boilerplate, the well-specified fixes — and escalate the genuinely hard, ambiguous, long-horizon tasks to a closed frontier agent. That keeps most of your code on your own hardware and most of your spend near zero, while reserving the expensive closed model for the slice where it still clearly wins. Sovereignty is not all-or-nothing; it is a dial, and the hybrid is where most of the dial sits today.

Check your understanding

Think of two coding tasks from your own work: one scoped, well-specified, and checkable by tests, and one open-ended, ambiguous, and spanning many files. Which would you trust a local open model to do, and which would you escalate? The line between them is your jagged frontier — and naming it is what lets you run the hybrid on purpose instead of discovering its edges by accident.

Chapter Six

When local wins


Reading Synthesis of Chapters 1–5; the first-run docs for Ollama and Aider; and Unit one's decision chapter, Run Your Own Model.

6.1A local coding loop, end to end

The shortest path from nothing to a private coding agent is a handful of commands. Pull an agent-tuned model that fits your hardware, point an agent at it, and you are coding with nothing leaving your machine:

# pull an agent-tuned coding model that fits one 24 GB GPU
ollama pull devstral-small:24b

# drive it with Aider — git-native, commits each change as a clean diff
pip install aider-chat
aider --model ollama/devstral-small:24b

# or a terminal agent: point OpenCode at the same local endpoint
OPENAI_BASE_URL=http://localhost:11434/v1 opencode
From nothing to a private, local coding agent.

Then run the same test Unit one ended on: disconnect from the network and give the agent a small, scoped task. If it reads your files, proposes an edit, and you can run your tests against it with no internet at all, you have a coding assistant that is entirely yours — both halves, on your hardware, your source never crossing out.

6.2The decision

When is that the right call rather than reaching for a closed agent? The same fork as Unit one, with one coding-specific turn. If your code is sovereign or regulated — it cannot leave — the decision is made, and you run local, full stop. If it can leave, then task-shape decides: scoped, routine, well-specified work favors a local open model; long-horizon, ambiguous, many-file work still favors a closed agent, or the hybrid. Cost rarely decides on its own, because self-hosting trades a per-token bill for a hardware-and-time bill — the open tokens are cheap, but the machine and the hours running it are not free, so price your own time honestly.

6.3Own the tools

That is the second floor of the cluster. You can now read a coding model's license and benchmarks with a clear eye, drive it with an open agent you control, keep the whole loop on your own hardware, and judge honestly when to do so. Unit one was own the model; this was own the tools. The arc closes in Unit three, own the stack, which reaches past text into the open models for images, audio, and embeddings — on the same hardware, with the same sovereignty, and the same honest accounting of where open is ready and where it is not. The thread through all three is unchanged: own your tools rather than rent them, and keep the authority over your work where it belongs.

Practice · put it together

Take one real task from your week and run it through the whole unit: pick a model from Chapter 2 sized to your hardware from Chapter 4, drive it with an agent from Chapter 3, walk it through the Chapter 6 decision, and name the one verification from Chapter 5 you would never skip. If you reach the end and the honest answer is "this one should go to a closed agent," that is a correct outcome — knowing which tasks are not yet local work is exactly the judgment this unit was for.

Reference

Glossary


Agentic coding
Using an AI model inside a loop that reads a codebase, makes edits, runs tests, and iterates — not just answering a question, but acting on a repository.
Harness / agent
The program that drives the model in that loop: it slices context, applies diffs, runs tests, and decides whether to continue. The model is the engine; the harness is the driver.
OpenAI-compatible endpoint
The de facto standard API shape that local model servers (Ollama, vLLM, LM Studio, SGLang) expose and open agents point at — the seam that makes model and harness independently swappable.
Mixture of experts (MoE)
A model design quoting total and active parameters; a router uses only the active fraction per token, so total governs memory-to-fit while active governs speed. Why trillion-scale coders are runnable.
SWE-bench / Verified
The standard benchmark for issue-fixing: an agent must patch a real repository so its tests pass. Verified is a human-filtered 500-issue subset. Both judge by tests, which is their weakness.
SWE-bench Pro / Terminal-Bench
Harder agentic benchmarks for longer-horizon and terminal-based coding. Most published scores are vendor-run unless an independent tracker reproduces them.
Weak test oracle
The flaw behind inflated scores: because a repository's tests aren't exhaustive, a patch can pass them and still be semantically wrong — so leaderboard numbers overstate real capability.
Model-agnostic / BYO-LLM
An agent that can run any model you supply via the OpenAI-compatible endpoint, rather than being locked to one provider. All four agents in this unit are this.
vLLM / SGLang
Serving engines built for high-throughput inference of large models, with the batching and memory paging the big MoE coders need — the step up from Ollama for frontier-scale weights.
Distillation
Training one model on another model's outputs. The basis of accusations, disputed, that several Chinese open labs trained on Claude — a reminder that running a model locally doesn't settle where its capability came from.
The hybrid
Running a local open model for routine coding and escalating hard, ambiguous, long-horizon tasks to a closed frontier agent. Where most sovereignty-minded teams actually sit today.
Jagged frontier
The uneven boundary of capability: some tasks fall within an open model's reach and others of similar apparent difficulty fall just outside, hard to tell apart without expertise. Carried over from Who the Agent Rewards.
Sources

Z.ai, GLM-5.2 documentation · MiniMax M3, DeepSeek V4, Kimi K2, and Qwen3-Coder model cards on Hugging Face · independent tracking via llm-stats and the SWE-bench leaderboard · NIST, CAISI Evaluation of DeepSeek V4 Pro (May 2026) · the Aider, OpenCode, Cline, and Goose projects · Are "Solved Issues" in SWE-bench Really Solved Correctly? (ICSE 2026) and the SWE-ABS re-evaluation · the companion guide Who the Agent Rewards.

An unofficial study text; model scores and prices are dated to June 2026 and move quickly — verify against the primary before quoting. Commentary synthesized in original wording; every figure is drawn from the cited primary research and was verified against it. Don't cite Urania or this guide — cite the primary sources above. Researched and drafted by Urania, an AI research system; edited, verified, and signed by Zach Rossmiller, who is accountable for what's published.