Run Your Own Model

Chapter One

What open buys you

Reading Open Source Initiative, The Open Source AI Definition; G7, Vision on AI openness (June 2026); Epoch AI, Open models lag closed models by 4 months.

1.1Open-weight is not open-source

Start with the word, because almost every confusion downstream begins here. "Open" hides two different questions. The first: can you download the finished model and run it? The second: can you see and rebuild everything that produced it — the training code, the data, the recipe? A model can answer yes to the first and no to the second, and most do.

The numbers that make a model work are its weights. A model that ships its weights under a license you can use is an open-weight model — you can download it, run it, often fine-tune it. That is what most people mean, loosely, by "open source AI." But the Open Source Initiative, the body that has defined "open source" for software since 1998, draws a stricter line for AI. Its Open Source AI Definition asks for the weights and the training code and enough information about the training data to recreate a substantially equivalent model. By that bar, few models qualify — the genuinely open ones are a short list, and the famous "open" models are not on it.

The clearest example of the strict kind is AI2's OLMo. The third generation, released in late 2025, ships not just weights but the full pipeline: the training code, the Dolma 3 data corpus and its sub-mixes, intermediate checkpoints, training logs, and a tool called OlmoTrace that lets you click any sentence the model produced and see the training documents it most likely came from — an auditable path from raw data to deployed model, under a plain Apache 2.0 license. That is what "open source" means when taken literally. It is rare because it is expensive and legally fraught to release the data, which is exactly the fight the next section is about.

1.2The spectrum, and the licenses that bite

In June 2026 the G7's technology ministers published a short paper trying to fix the vocabulary, and its central claim is the one to carry forward: the openness of an AI is not binary. It exists on a spectrum, and the paper sets out four tiers, ordered from most open to least: Open Source AI with Open Data (everything open, including the full training data); Open Source AI (weights, deployment code, and training code, with data information standing in where the data itself cannot be shared); Open Weights AI (weights and the code to run them, under an open license, and nothing more); and Weights Available AI (weights you can download, but under a license carrying restrictions — commercial, geographic, or use-based — that disqualify it from being called open source at all).

That bottom tier is where some of the most famous "open" models actually sit, and the license is where you find out. Meta's Llama is the cautionary specimen. Its community license reads as permissive until you reach the clauses: a company whose products exceed 700 million monthly active users must request a separate license from Meta, granted at Meta's sole discretion; its acceptable-use policy withholds the rights to the multimodal models from anyone domiciled in the European Union; and anything you build must carry "Built with Llama" branding. Useful, widely run, genuinely capable — and, by the OSI's definition and Meta's own framing, source-available rather than open source.

Contrast that with the genuinely permissive licenses, which is most of the competitive field now: DeepSeek under MIT, Qwen and Mistral under Apache 2.0, Z.ai's GLM under MIT. These carry no user cap, no geographic ban, no branding tax — you can run, modify, and ship without asking. The practical lesson is simple and worth making a habit: before you build on a model, read its license, not its marketing. The word "open" on the download page tells you almost nothing; the license file tells you everything.

1.3How far behind, really

Here is the claim the whole unit rests on, and it is measurable. Epoch AI tracks the gap between the best open-weight models and the best closed models on a common capability index. Since January 2026 that gap has averaged about four months, or roughly eight index points — about the distance between GPT-5 and GPT-5.5. And it has held at that rough width for more than two years, even as the frontier itself accelerated. Open is not catching up in fits and starts; it is trailing at a steady, short distance.

Figure 1. The closing gap, drawn from Epoch AI's capability-index data. The open line tracks the closed frontier at a roughly four-month delay — reaching each level of capability a season after the closed models do, and holding that distance as both climb. The lag is the story, not any single model's score.

Two honest caveats keep this from being hype. The gap depends on how you measure: on public benchmarks it runs about four to six months, but on contamination-resistant private benchmarks — the harder, fairer test — it widens to eight to ten. And it varies over time, occasionally closing entirely before the next closed release reopens it. So the accurate sentence is not "open has won," but "open is roughly one season behind, and the season has stayed about the same length for years."

Where open does not merely catch up but wins outright is cost. Take the model we will use as a running example: Z.ai's GLM-5.2, a frontier-class open-weight coding model. Its list price is $1.40 per million input tokens and $4.40 per million output, against Claude Opus 4.8 at $5 and $25* — roughly four to six times cheaper per token, about one-sixth the cost, for a model a short step behind. Reach for a cheaper, smaller open model and the gap against the flagship widens to thirty times or more, though you are no longer comparing like for like on capability. And the moment you self-host, the per-token cost disappears entirely — you pay for the hardware and the power, nothing per question. That is the lever the closed labs cannot match, and the reason the premium on API access faces long-term pressure.

*List prices as of June 2026; model API prices change often — verify against the provider's current pricing page before you quote them.

Check your understanding: read the chart, not the bars

Find a recent benchmark chart a lab published for its own model. Before believing it, ask four questions. Whose logo is on it — is the lab grading its own homework? Did the lab choose which benchmarks appear? What do the footnotes say — "maximum effort," "with tools," a setting you would never use by default? And has anyone independent reproduced the numbers yet? A vendor's bars are a marketing claim until a neutral party confirms them. The capability gap that holds up to scrutiny is the one measured across many models by someone with nothing to sell.

Chapter Two

How a model is shaped

Reading Model cards on Hugging Face; the llama.cpp and vLLM documentation on quantization, context, and sampling. Confirm any specific number against the card for the exact model you run.

2.1Parameters: size, and the trick of "active"

A model's parameters are the weights — the trained numbers — and their count is the headline size, quoted in billions: an 8B model, a 70B model. More parameters generally means more capability and more memory, in rough proportion. That is the first dial, and for a long time it was the whole story.

The complication is the mixture-of-experts design, now standard at the frontier. A mixture-of-experts model quotes two numbers: total parameters and active parameters. GLM-5.2 is about 744 billion total but activates only around 40 billion per token; DeepSeek's flagship is 671 billion total, 37 billion active. A routing layer picks a small subset of "experts" for each token, so the model has the knowledge of the large number but the per-token compute cost of the small one. The practical consequence you must hold onto: the total parameter count governs how much memory the model needs to load — you still have to fit all of it — while the active count governs how fast it runs once loaded. A huge mixture-of-experts model can be quick per token and still demand an enormous amount of memory to hold.

2.2Quantization: trading precision for room

A weight is a number, and a number can be stored at different precisions. Models are trained at sixteen bits per weight, but you rarely run them that way. Quantization compresses each weight to fewer bits — eight, or four, sometimes fewer — shrinking the model so it fits on hardware you can afford. The common notation you will see, like Q4_K_M, names the bit depth and the method; four-bit K_M is the sweet spot most local users settle on, the smallest size that holds quality for most work.

The rule of thumb that makes this concrete: the model's file size on disk is approximately the memory it needs to run. A 70B model at sixteen bits is about 140 gigabytes; at eight bits, about 70; at four bits, about 35. That last number is why a four-bit 70B fits on a pair of consumer cards while the full-precision version needs a server.

Figure 2. One 70-billion-parameter model, by precision. Each step halves the memory it needs; the file size on disk is roughly the memory to run it. The dashed line is a 48 GB build (two used cards) — the four-bit version fits under it, eight-bit does not, and full precision needs a server. Quantization is what brings a big model down onto hardware you own.

Does the compression hurt? Yes, but less than you would fear, and unevenly. Four-bit quantization costs a little quality, and on most tasks it is hard to notice; the loss is real but task-dependent and often invisible to generic benchmarks, which is exactly why you should test on your task rather than trust a leaderboard. Push below four bits and the degradation steepens quickly. The honest framing: quantization is how a big model reaches your hardware, and the price is a quality margin you should measure, not assume.

2.3Context and temperature: memory and nerve

Two more dials and you have the working set. The context window is how much the model can read at once — 128,000 tokens, sometimes a million. It sounds like pure upside, but it has a memory cost: the model keeps a running cache (the "KV cache") whose size grows with the length of the context, and on local hardware that cache competes with the model itself for room. A long-context, high-throughput workload can run you out of memory even when the model fits.

The last dial is temperature, with its companions like top-p and min-p — the controls on randomness. Low temperature makes the model pick the likeliest next word, which you want for code and factual work; higher temperature loosens it, which you want for brainstorming and prose. One discipline matters more than the exact numbers: every model expects a particular chat format and a particular set of default settings, published on its model card, and if you ignore them the output degrades or malforms in ways that look like the model is broken when really you are holding it wrong. Read the card, match its template, start from its defaults.

Practice: size the model to the machine

Pick a model you would like to run and find its card. Note its total parameter count, and whether it is a mixture-of-experts model with a smaller active count. Estimate its size at four-bit quantization (roughly half the parameter count in billions, as gigabytes — a 70B is about 35 GB). Then compare that to the memory on the machine you actually have. If the four-bit size exceeds your memory, you have three honest options: a smaller model, a more aggressive quantization, or different hardware — which is the whole of the next chapter.

Chapter Three

Why local at all

Reading The privacy documentation and offline guides for Ollama and LM Studio; the Global Indigenous Data Alliance on the CARE Principles; the Collaboratory for Indigenous Data Governance's 2026 work on local models and data sovereignty.

3.1Your words never leave the box

The plainest reason to run a model locally is the one that survives every other argument: the data never leaves your machine. When you call a hosted model, your prompt, your documents, and the model's answer all travel to a third party whose data-retention policy now applies to your words. Run the model on your own hardware and that entire category of concern vanishes — no tokens sent, no external retention policy, no multi-party data agreement to negotiate. For a casual user that is a nicety. For anyone working with a sensitive codebase, patient records, student data, or legally privileged material, it is not a preference, it is a requirement.

Figure 3. Where your prompt goes. Run the model locally (left) and the whole loop — prompt, model, answer — stays inside the boundary of your machine; nothing crosses. Call a hosted API (right) and your words cross out to the provider's servers and back. Same task, two very different data paths.

3.2"Local" is not automatically private

The word "local" promises more than it always delivers, and the gap is where people get burned. The model's inference may run on your device while the application around it still checks for updates, downloads models, syncs settings, or quietly sends usage analytics. So treat "local" as a claim to verify, with three questions. Does the model license restrict what you can do — the Chapter 1 question. Does the tool phone home? And is this "local" app actually running locally?

That third question is not hypothetical. Ollama, the most popular local runner, sends no telemetry by default and, once a model is pulled, runs fully offline — packet captures confirm zero outbound traffic. But it also offers cloud models, tagged with a -cloud suffix, that run on Ollama's servers rather than your machine and look almost identical in the interface; you can think you are local and be hitting a data center. Separately, its local API binds to your machine alone by default, but a single setting (OLLAMA_HOST=0.0.0.0) exposes it to your whole network without a password. LM Studio takes the opposite shape: its privacy policy is so strict it holds no user data at all — it cannot honor a deletion request because there is nothing to delete — but it is a closed-source binary, so you are trusting the company's word rather than auditing the code. The lesson is not that any one tool is unsafe; it is that "local" is a setting and a habit, not a guarantee. Verify the telemetry, check the network exposure, and confirm the model you are running is the one on your disk.

3.3Sovereignty is authority, not access

Underneath privacy sits the idea that gives this whole unit its spine, and it is worth stating precisely because the word "open" is doing double duty across this guide. In the software world, "open" means access — you can get it, use it, reuse it. But there is an older, harder claim that openness can run against. Indigenous data governance frames it through the CARE Principles — Collective benefit, Authority to control, Responsibility, Ethics — which arose specifically in response to "open data" movements that prioritized access and reuse without regard for who holds the authority to decide. The same word, "open," pulling in opposite directions: open as in anyone-may-take, versus sovereign as in we-decide.

Running a model locally is where that abstraction becomes a tool. It is not a coincidence that the Collaboratory for Indigenous Data Governance ran a 2026 session titled "AI on Your Terms: Running Local Models for Data Sovereignty" — local models are how a community can use AI without surrendering the authority over its own data that the cloud arrangement quietly takes. And here is the part that matters for everyone, not only for sovereign nations: sovereignty is a property of where the model runs, not of how much you spent. A small open model on a four-hundred-dollar mini-PC keeps your data inside your walls exactly as completely as a frontier model on a ten-thousand-dollar workstation. The expensive thing is capability at the very top of the frontier. Authority over your own data is available at the bottom of the market — which is the reason this guide opens the hardware chapter at the cheap floor and climbs, rather than the other way around.

Practice: find your line

Name one kind of data you work with that you would not be comfortable sending to a third-party server — and one kind you would. The line between them is your real reason for caring about local inference; everyone's sits in a different place. Then, for the data on the wrong side of that line, ask the three questions of any tool you would use to handle it: what its license permits, whether it phones home, and whether "local" really means local. If you cannot answer all three from the tool's own documentation, that is the tool to be most careful with.

Chapter Four

The hardware question

Reading Manufacturer spec pages (NVIDIA DGX Spark, Apple Mac Studio) and community benchmarks (the r/LocalLLaMA hardware threads) for the live numbers; Epoch AI on frontier capability reaching consumer hardware. Verify any spec against the manufacturer's own page before you buy.

4.1The one model that doesn't rot

The specific cards will be obsolete by the time you read this; the way to think about them will not. Two numbers decide a local-LLM machine. The first is memory capacity — how much fast memory it has, whether that's a graphics card's VRAM or a Mac's unified memory — because capacity decides what fits: if the model plus its context cache won't fit in fast memory, it won't run there at all. The second is memory bandwidth — how fast that memory can be read — because bandwidth decides how fast the model feels, since producing each token means streaming the whole active model through the processor once. Capacity sets the ceiling on size; bandwidth sets the speed within it.

And there is one cliff to respect. When a model spills out of fast memory into ordinary system RAM, throughput does not degrade gently — it collapses, from a brisk conversation to something closer to reading a telegram through a straw. Fitting the model in fast memory is not a nice-to-have; it is the line between usable and useless.

4.2Climb from the floor

Because capacity is the gate, the honest way to choose is to start at the bottom and climb only as far as your models actually require. The rungs, cheapest first:

Tier	Fast memory	Runs	Rough cost	The catch
Use what you own	8–16 GB laptop or desktop	a 7–8B model, CPU-only	$0	slow-ish but real; 8 GB is the practical floor
Cheapest dedicated	Pi 5, a used 12 GB card, or a mini-PC	1–3B on a Pi, up to 7–8B on the card	~$60–650	small models only; a used 12 GB card runs an 8B near reading speed
Build-your-own Linux rig	2× used 24 GB cards = 48 GB	a 70B at 4-bit	~$1,200	you assemble and maintain it; Linux; speed needs the right serving software
Single consumer GPU	one 32 GB card	up to ~34B	~$2–3k	a 70B won't fit; fast within its ceiling
Apple unified memory	24 GB to 512 GB	7B up to the largest single-box models	~$1.4k–10k	no CUDA stack; the bandwidth leader at the top
NVIDIA DGX Spark	128 GB unified	a 70B at 4-bit	~$4k	CUDA-native; strong compute but modest bandwidth
Don't buy	—	—	by the hour	occasional use — rent, don't own

Figure 4. The capacity ladder. Each rung is a larger class of model you can hold, not a faster one — speed is bandwidth, a separate axis. Climb only as far as your models need. The most important rung is the bottom: sovereignty is present at every step, from a sixty-dollar Pi to a ten-thousand-dollar workstation.

The most important rung is the bottom one, and it is the reason this guide climbs rather than descends. A small open model on a Raspberry Pi keeps your data inside your walls exactly as completely as a frontier model on a workstation does. The expensive thing is capability at the very top; authority over your own data is available for the price of a single board.

For everyone between those extremes, the best capability-per-dollar is not any polished prosumer box — it is a self-built Linux machine with two used 24 GB cards, around $1,200 for 48 GB of memory, enough to run a 70B model at four-bit. The price of that value is your own labor: you assemble it, you run Linux, you manage power and heat, and raw capacity scales cheaply while speed needs proper serving software — which is the next chapter.

Practice: size your shortlist to a tier

Take the two or three models you would actually want to run and estimate each one's four-bit size (roughly half the parameter count in billions, as gigabytes). Then find the cheapest tier in the table above whose fast memory holds your largest one. If nothing affordable holds it, you have learned something useful before spending a dollar: either a smaller model does your job, or your job genuinely needs the bigger machine. Decide which on purpose, not by accident at checkout.

Chapter Five

Standing one up

Reading The documentation for Ollama, LM Studio, llama.cpp, Open WebUI, and vLLM.

5.1One engine, several faces

The tools trip up newcomers because they overlap, but they stack in a clear order. At the base sits llama.cpp, the C++ engine that actually runs quantized models on CPUs and consumer GPUs — almost everything else is a friendlier face over it. Ollama wraps that engine in a single command-line tool with a local server and an API that speaks the same protocol as the big cloud providers, which makes it the default for anyone comfortable in a terminal or wiring a model into an application. LM Studio is the graphical equivalent: a desktop app with a model browser, a chat window, and the same local server behind a button, friendlier if you would rather not touch a command line. Open WebUI puts a polished browser chat interface in front of either. And when you need to serve a model to several people or a production application at real throughput, vLLM is the serving engine built for that job, with the batching and memory techniques the lighter tools leave out.

Figure 5. One engine, several faces. llama.cpp does the actual running; Ollama and LM Studio are the two common ways to drive it; Open WebUI is a browser front end, and vLLM is the engine to reach for when you're serving a model to many users at once. Pick the face that fits how you work — they all run the same models.

5.2The first run

The shortest path from nothing to a running model is two commands. With Ollama, you install it, then pull and run a small model in one step:

# install Ollama on macOS or Linux, then pull and run a small model
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.2:8b

# it now serves an API at localhost:11434 that speaks the OpenAI protocol —
# point any compatible client at that URL and change the model name

From nothing to a local model.

Then do the one test that proves it is yours: disconnect from the network and ask it something. If it answers, the model is running on your machine and nothing is leaving it. While you are there, apply the habits from Chapter 3 — confirm the tool's telemetry is off, make sure the model name has no -cloud suffix, and don't expose the local server to your network unless you mean to. The model that fits your hardware, runs offline, and keeps its port to itself is the one you actually control.

Lab: stand one up and pull the plug

Install a runner — Ollama if you like a terminal, LM Studio if you prefer a window — and pull a model sized to the tier you landed on in Chapter 4. Have a short conversation. Then turn off your wifi and have another one. That second conversation, happening with no internet at all, is the whole point of this unit made concrete: a capable model answering your questions with nothing crossing the boundary of your machine.

Chapter Six

When local wins

Reading Synthesis of Chapters 1–5; total-cost-of-ownership analyses, treated as heuristics rather than fixed thresholds; and the companion guide Who the Agent Rewards on the jagged frontier.

6.1The honest economics

For all the sovereignty argument, most people first ask whether local is cheaper, and the honest answer is: it depends on volume, and on whether you count your own time. As a rough rule, below something like a billion tokens a day a hosted API usually wins once you price the hours you would spend running and maintaining your own machine. Self-hosting earns its keep in two situations: at genuine scale, where the per-token bill dwarfs the hardware, or under a sovereignty or compliance requirement, where the data simply cannot leave and the question of price never comes up. The decision is not "local is cheaper" or "the cloud is cheaper" — it is a fork with two honest branches.

Figure 6. The decision, as two questions. Sovereignty answers first and absolutely — if the data can't leave, the price comparison never happens. Only when it can leave does cost decide, and then volume is the hinge. Most workloads that aren't sovereign and aren't high-volume are genuinely better served by an API; saying so is what keeps the rest of this guide honest.

6.2The showcase, fairly

So here is the claim, stated as plainly and as honestly as the evidence allows. Open models now trail the closed frontier by about a season, run at roughly a fifth to a sixth of the per-token cost, and — once self-hosted — keep your data entirely in your hands. That is not "open beats the best closed model at everything." On the hardest tasks, and on the contamination-resistant benchmarks that test them, the gap is real and sometimes wide, and which side of the line a given task falls on is not always obvious from the outside — the jagged frontier that the companion guide, Who the Agent Rewards, is entirely about. The honest pitch is the one worth repeating: frontier-quality work, a short step behind, on hardware you own, at a fraction of the price, with your data never leaving the building. For a great deal of real work, that trade is not just acceptable — it is the better deal.

6.3Own it, don't rent it

This is the first of three units, and it was the foundation: you can now read a license, understand what a model needs and how it behaves, run one on your own hardware, and judge when doing so is the right call. The arc continues from here — own the model, then own the tools, then own the stack. Unit two puts the model you just stood up to work inside an agentic coding loop, comparing the open command-line tools that drive it. Unit three reaches past text into images, audio, and the rest of the open ecosystem, on the same hardware. The thread running through all three is the one this unit began with: own your tools rather than rent them, one floor at a time, and keep the authority over your own work where it belongs — with you.

Practice: make the call

Take one workload you actually have and walk it through the figure. Is its data sovereign or regulated? If so, you are done — it stays local, and cost never entered the picture. If not, is it high-volume or always-on enough that the per-token bill would outrun a one-time machine and your time to run it? Write down the answer and the one fact that decided it. That single sentence is the difference between choosing local on purpose and adopting it as a fashion — and choosing on purpose is the entire point.

Run Your Own ModelJune 2026