Learn/Run Your Own Model
Equity & Sovereignty

Run Your Own ModelJune 2026

Open models have nearly caught the closed frontier — and you can run one on hardware you own, with your data never leaving the building. Here's what 'open' really means, how a model is built, and why local is the sovereign choice.

This is the first of three units on running open models on your own hardware: own the model, own the tools, own the stack. It is the foundation the other two build on, and its subject is text — chat and reasoning models — not image, audio, or the coding agents that come later. It draws on the Open Source Initiative's AI definition, a June 2026 G7 paper on what "open" should mean, capability-gap data from Epoch AI, the model cards and licenses of the labs themselves, and the published privacy documentation of the local-inference tools. Where a number could move — prices, benchmarks, the frontier — it was verified against a primary source and dated; treat all of it as true as of June 2026 and check before you quote it.

In Plain Terms

There is a story you have probably absorbed without noticing: that the good AI lives in someone else's data center, and you reach it by sending your words across the internet and waiting for an answer to come back. For a lot of work that is fine. But it is not the only arrangement, and for some work it is the wrong one.

An open model is one whose insides — the trained numbers that make it work — you are allowed to download and keep. Once you have them, you can run the model on your own computer. Your questions never leave the room. There is no bill per question, no rate limit, no company logging what you asked. The catch is that you supply the machine and the electricity, and you have to know enough to set it up.

Two things people assume about these models are both wrong, in opposite directions. The first wrong assumption is that "open" models are far behind the famous closed ones — toys next to the real thing. They are not. The best open models now trail the best closed ones by only about four months, and they cost a fraction as much to run. The second wrong assumption is that "open" is a single clear category, like a light switch that is either on or off. It is not. There is a wide spectrum between "you can download the finished model" and "you can see and rebuild everything that went into it," and the labels in between are contested enough that the world's richest governments spent a paper trying to sort them out.

The deeper reason to care is not price, and it is not even privacy in the everyday sense. It is authority — who decides. When the model runs on a machine you control, you are not asking anyone's permission, you are not subject to anyone's terms changing next quarter, and your data is not part of anyone else's business. For an individual that is convenience. For a clinic, a school, a tribal government, or anyone holding data that is not theirs to hand away, it is the whole point. A small model on a four-hundred-dollar box keeps your data exactly as much in your hands as a model on a ten-thousand-dollar one. Sovereignty turns out to be cheap; it is capability at the very top that costs.

This unit is the groundwork. By the end you will be able to read a model's license and place it honestly on the open-to-closed spectrum, understand the four dials — size, precision, memory, and randomness — that decide what a model needs and how it behaves, and reason clearly about when keeping the model on your own hardware is the right call rather than a hobbyist's affectation.

Chapter One

What open buys you


Reading Open Source Initiative, The Open Source AI Definition; G7, Vision on AI openness (June 2026); Epoch AI, Open models lag closed models by 4 months.

1.1Open-weight is not open-source

Start with the word, because almost every confusion downstream begins here. "Open" hides two different questions. The first: can you download the finished model and run it? The second: can you see and rebuild everything that produced it — the training code, the data, the recipe? A model can answer yes to the first and no to the second, and most do.

The numbers that make a model work are its weights. A model that ships its weights under a license you can use is an open-weight model — you can download it, run it, often fine-tune it. That is what most people mean, loosely, by "open source AI." But the Open Source Initiative, the body that has defined "open source" for software since 1998, draws a stricter line for AI. Its Open Source AI Definition asks for the weights and the training code and enough information about the training data to recreate a substantially equivalent model. By that bar, few models qualify — the genuinely open ones are a short list, and the famous "open" models are not on it.

The clearest example of the strict kind is AI2's OLMo. The third generation, released in late 2025, ships not just weights but the full pipeline: the training code, the Dolma 3 data corpus and its sub-mixes, intermediate checkpoints, training logs, and a tool called OlmoTrace that lets you click any sentence the model produced and see the training documents it most likely came from — an auditable path from raw data to deployed model, under a plain Apache 2.0 license. That is what "open source" means when taken literally. It is rare because it is expensive and legally fraught to release the data, which is exactly the fight the next section is about.

1.2The spectrum, and the licenses that bite

In June 2026 the G7's technology ministers published a short paper trying to fix the vocabulary, and its central claim is the one to carry forward: the openness of an AI is not binary. It exists on a spectrum, and the paper sets out four tiers, ordered from most open to least: Open Source AI with Open Data (everything open, including the full training data); Open Source AI (weights, deployment code, and training code, with data information standing in where the data itself cannot be shared); Open Weights AI (weights and the code to run them, under an open license, and nothing more); and Weights Available AI (weights you can download, but under a license carrying restrictions — commercial, geographic, or use-based — that disqualify it from being called open source at all).

That bottom tier is where some of the most famous "open" models actually sit, and the license is where you find out. Meta's Llama is the cautionary specimen. Its community license reads as permissive until you reach the clauses: a company whose products exceed 700 million monthly active users must request a separate license from Meta, granted at Meta's sole discretion; its acceptable-use policy withholds the rights to the multimodal models from anyone domiciled in the European Union; and anything you build must carry "Built with Llama" branding. Useful, widely run, genuinely capable — and, by the OSI's definition and Meta's own framing, source-available rather than open source.

Contrast that with the genuinely permissive licenses, which is most of the competitive field now: DeepSeek under MIT, Qwen and Mistral under Apache 2.0, Z.ai's GLM under MIT. These carry no user cap, no geographic ban, no branding tax — you can run, modify, and ship without asking. The practical lesson is simple and worth making a habit: before you build on a model, read its license, not its marketing. The word "open" on the download page tells you almost nothing; the license file tells you everything.

1.3How far behind, really

Here is the claim the whole unit rests on, and it is measurable. Epoch AI tracks the gap between the best open-weight models and the best closed models on a common capability index. Since January 2026 that gap has averaged about four months, or roughly eight index points — about the distance between GPT-5 and GPT-5.5. And it has held at that rough width for more than two years, even as the frontier itself accelerated. Open is not catching up in fits and starts; it is trailing at a steady, short distance.

Two honest caveats keep this from being hype. The gap depends on how you measure: on public benchmarks it runs about four to six months, but on contamination-resistant private benchmarks — the harder, fairer test — it widens to eight to ten. And it varies over time, occasionally closing entirely before the next closed release reopens it. So the accurate sentence is not "open has won," but "open is roughly one season behind, and the season has stayed about the same length for years."

Where open does not merely catch up but wins outright is cost. Take the model we will use as a running example: Z.ai's GLM-5.2, a frontier-class open-weight coding model. Its list price is $1.40 per million input tokens and $4.40 per million output, against Claude Opus 4.8 at $5 and $25* — roughly four to six times cheaper per token, about one-sixth the cost, for a model a short step behind. Reach for a cheaper, smaller open model and the gap against the flagship widens to thirty times or more, though you are no longer comparing like for like on capability. And the moment you self-host, the per-token cost disappears entirely — you pay for the hardware and the power, nothing per question. That is the lever the closed labs cannot match, and the reason the premium on API access faces long-term pressure.

*List prices as of June 2026; model API prices change often — verify against the provider's current pricing page before you quote them.

Check your understanding: read the chart, not the bars

Find a recent benchmark chart a lab published for its own model. Before believing it, ask four questions. Whose logo is on it — is the lab grading its own homework? Did the lab choose which benchmarks appear? What do the footnotes say — "maximum effort," "with tools," a setting you would never use by default? And has anyone independent reproduced the numbers yet? A vendor's bars are a marketing claim until a neutral party confirms them. The capability gap that holds up to scrutiny is the one measured across many models by someone with nothing to sell.

Chapter Two

How a model is shaped


Reading Model cards on Hugging Face; the llama.cpp and vLLM documentation on quantization, context, and sampling. Confirm any specific number against the card for the exact model you run.

2.1Parameters: size, and the trick of "active"

A model's parameters are the weights — the trained numbers — and their count is the headline size, quoted in billions: an 8B model, a 70B model. More parameters generally means more capability and more memory, in rough proportion. That is the first dial, and for a long time it was the whole story.

The complication is the mixture-of-experts design, now standard at the frontier. A mixture-of-experts model quotes two numbers: total parameters and active parameters. GLM-5.2 is about 744 billion total but activates only around 40 billion per token; DeepSeek's flagship is 671 billion total, 37 billion active. A routing layer picks a small subset of "experts" for each token, so the model has the knowledge of the large number but the per-token compute cost of the small one. The practical consequence you must hold onto: the total parameter count governs how much memory the model needs to load — you still have to fit all of it — while the active count governs how fast it runs once loaded. A huge mixture-of-experts model can be quick per token and still demand an enormous amount of memory to hold.

2.2Quantization: trading precision for room

A weight is a number, and a number can be stored at different precisions. Models are trained at sixteen bits per weight, but you rarely run them that way. Quantization compresses each weight to fewer bits — eight, or four, sometimes fewer — shrinking the model so it fits on hardware you can afford. The common notation you will see, like Q4_K_M, names the bit depth and the method; four-bit K_M is the sweet spot most local users settle on, the smallest size that holds quality for most work.

The rule of thumb that makes this concrete: the model's file size on disk is approximately the memory it needs to run. A 70B model at sixteen bits is about 140 gigabytes; at eight bits, about 70; at four bits, about 35. That last number is why a four-bit 70B fits on a pair of consumer cards while the full-precision version needs a server.

Does the compression hurt? Yes, but less than you would fear, and unevenly. Four-bit quantization costs a little quality, and on most tasks it is hard to notice; the loss is real but task-dependent and often invisible to generic benchmarks, which is exactly why you should test on your task rather than trust a leaderboard. Push below four bits and the degradation steepens quickly. The honest framing: quantization is how a big model reaches your hardware, and the price is a quality margin you should measure, not assume.

2.3Context and temperature: memory and nerve

Two more dials and you have the working set. The context window is how much the model can read at once — 128,000 tokens, sometimes a million. It sounds like pure upside, but it has a memory cost: the model keeps a running cache (the "KV cache") whose size grows with the length of the context, and on local hardware that cache competes with the model itself for room. A long-context, high-throughput workload can run you out of memory even when the model fits.

The last dial is temperature, with its companions like top-p and min-p — the controls on randomness. Low temperature makes the model pick the likeliest next word, which you want for code and factual work; higher temperature loosens it, which you want for brainstorming and prose. One discipline matters more than the exact numbers: every model expects a particular chat format and a particular set of default settings, published on its model card, and if you ignore them the output degrades or malforms in ways that look like the model is broken when really you are holding it wrong. Read the card, match its template, start from its defaults.

Practice: size the model to the machine

Pick a model you would like to run and find its card. Note its total parameter count, and whether it is a mixture-of-experts model with a smaller active count. Estimate its size at four-bit quantization (roughly half the parameter count in billions, as gigabytes — a 70B is about 35 GB). Then compare that to the memory on the machine you actually have. If the four-bit size exceeds your memory, you have three honest options: a smaller model, a more aggressive quantization, or different hardware — which is the whole of the next chapter.

Chapter Three

Why local at all


Reading The privacy documentation and offline guides for Ollama and LM Studio; the Global Indigenous Data Alliance on the CARE Principles; the Collaboratory for Indigenous Data Governance's 2026 work on local models and data sovereignty.

3.1Your words never leave the box

The plainest reason to run a model locally is the one that survives every other argument: the data never leaves your machine. When you call a hosted model, your prompt, your documents, and the model's answer all travel to a third party whose data-retention policy now applies to your words. Run the model on your own hardware and that entire category of concern vanishes — no tokens sent, no external retention policy, no multi-party data agreement to negotiate. For a casual user that is a nicety. For anyone working with a sensitive codebase, patient records, student data, or legally privileged material, it is not a preference, it is a requirement.

3.2"Local" is not automatically private

The word "local" promises more than it always delivers, and the gap is where people get burned. The model's inference may run on your device while the application around it still checks for updates, downloads models, syncs settings, or quietly sends usage analytics. So treat "local" as a claim to verify, with three questions. Does the model license restrict what you can do — the Chapter 1 question. Does the tool phone home? And is this "local" app actually running locally?

That third question is not hypothetical. Ollama, the most popular local runner, sends no telemetry by default and, once a model is pulled, runs fully offline — packet captures confirm zero outbound traffic. But it also offers cloud models, tagged with a -cloud suffix, that run on Ollama's servers rather than your machine and look almost identical in the interface; you can think you are local and be hitting a data center. Separately, its local API binds to your machine alone by default, but a single setting (OLLAMA_HOST=0.0.0.0) exposes it to your whole network without a password. LM Studio takes the opposite shape: its privacy policy is so strict it holds no user data at all — it cannot honor a deletion request because there is nothing to delete — but it is a closed-source binary, so you are trusting the company's word rather than auditing the code. The lesson is not that any one tool is unsafe; it is that "local" is a setting and a habit, not a guarantee. Verify the telemetry, check the network exposure, and confirm the model you are running is the one on your disk.

3.3Sovereignty is authority, not access

Underneath privacy sits the idea that gives this whole unit its spine, and it is worth stating precisely because the word "open" is doing double duty across this guide. In the software world, "open" means access — you can get it, use it, reuse it. But there is an older, harder claim that openness can run against. Indigenous data governance frames it through the CARE Principles — Collective benefit, Authority to control, Responsibility, Ethics — which arose specifically in response to "open data" movements that prioritized access and reuse without regard for who holds the authority to decide. The same word, "open," pulling in opposite directions: open as in anyone-may-take, versus sovereign as in we-decide.

Running a model locally is where that abstraction becomes a tool. It is not a coincidence that the Collaboratory for Indigenous Data Governance ran a 2026 session titled "AI on Your Terms: Running Local Models for Data Sovereignty" — local models are how a community can use AI without surrendering the authority over its own data that the cloud arrangement quietly takes. And here is the part that matters for everyone, not only for sovereign nations: sovereignty is a property of where the model runs, not of how much you spent. A small open model on a four-hundred-dollar mini-PC keeps your data inside your walls exactly as completely as a frontier model on a ten-thousand-dollar workstation. The expensive thing is capability at the very top of the frontier. Authority over your own data is available at the bottom of the market — which is the reason this guide opens the hardware chapter at the cheap floor and climbs, rather than the other way around.

Practice: find your line

Name one kind of data you work with that you would not be comfortable sending to a third-party server — and one kind you would. The line between them is your real reason for caring about local inference; everyone's sits in a different place. Then, for the data on the wrong side of that line, ask the three questions of any tool you would use to handle it: what its license permits, whether it phones home, and whether "local" really means local. If you cannot answer all three from the tool's own documentation, that is the tool to be most careful with.

Chapter Four

The hardware question


Reading Manufacturer spec pages (NVIDIA DGX Spark, Apple Mac Studio) and community benchmarks (the r/LocalLLaMA hardware threads) for the live numbers; Epoch AI on frontier capability reaching consumer hardware. Verify any spec against the manufacturer's own page before you buy.

4.1The one model that doesn't rot

The specific cards will be obsolete by the time you read this; the way to think about them will not. Two numbers decide a local-LLM machine. The first is memory capacity — how much fast memory it has, whether that's a graphics card's VRAM or a Mac's unified memory — because capacity decides what fits: if the model plus its context cache won't fit in fast memory, it won't run there at all. The second is memory bandwidth — how fast that memory can be read — because bandwidth decides how fast the model feels, since producing each token means streaming the whole active model through the processor once. Capacity sets the ceiling on size; bandwidth sets the speed within it.

And there is one cliff to respect. When a model spills out of fast memory into ordinary system RAM, throughput does not degrade gently — it collapses, from a brisk conversation to something closer to reading a telegram through a straw. Fitting the model in fast memory is not a nice-to-have; it is the line between usable and useless.

4.2Climb from the floor

Because capacity is the gate, the honest way to choose is to start at the bottom and climb only as far as your models actually require. The rungs, cheapest first:

TierFast memoryRunsRough costThe catch
Use what you own8–16 GB laptop or desktopa 7–8B model, CPU-only$0slow-ish but real; 8 GB is the practical floor
Cheapest dedicatedPi 5, a used 12 GB card, or a mini-PC1–3B on a Pi, up to 7–8B on the card~$60–650small models only; a used 12 GB card runs an 8B near reading speed
Build-your-own Linux rig2× used 24 GB cards = 48 GBa 70B at 4-bit~$1,200you assemble and maintain it; Linux; speed needs the right serving software
Single consumer GPUone 32 GB cardup to ~34B~$2–3ka 70B won't fit; fast within its ceiling
Apple unified memory24 GB to 512 GB7B up to the largest single-box models~$1.4k–10kno CUDA stack; the bandwidth leader at the top
NVIDIA DGX Spark128 GB unifieda 70B at 4-bit~$4kCUDA-native; strong compute but modest bandwidth
Don't buyby the houroccasional use — rent, don't own

The most important rung is the bottom one, and it is the reason this guide climbs rather than descends. A small open model on a Raspberry Pi keeps your data inside your walls exactly as completely as a frontier model on a workstation does. The expensive thing is capability at the very top; authority over your own data is available for the price of a single board.

For everyone between those extremes, the best capability-per-dollar is not any polished prosumer box — it is a self-built Linux machine with two used 24 GB cards, around $1,200 for 48 GB of memory, enough to run a 70B model at four-bit. The price of that value is your own labor: you assemble it, you run Linux, you manage power and heat, and raw capacity scales cheaply while speed needs proper serving software — which is the next chapter.

Practice: size your shortlist to a tier

Take the two or three models you would actually want to run and estimate each one's four-bit size (roughly half the parameter count in billions, as gigabytes). Then find the cheapest tier in the table above whose fast memory holds your largest one. If nothing affordable holds it, you have learned something useful before spending a dollar: either a smaller model does your job, or your job genuinely needs the bigger machine. Decide which on purpose, not by accident at checkout.

Chapter Five

Standing one up


Reading The documentation for Ollama, LM Studio, llama.cpp, Open WebUI, and vLLM.

5.1One engine, several faces

The tools trip up newcomers because they overlap, but they stack in a clear order. At the base sits llama.cpp, the C++ engine that actually runs quantized models on CPUs and consumer GPUs — almost everything else is a friendlier face over it. Ollama wraps that engine in a single command-line tool with a local server and an API that speaks the same protocol as the big cloud providers, which makes it the default for anyone comfortable in a terminal or wiring a model into an application. LM Studio is the graphical equivalent: a desktop app with a model browser, a chat window, and the same local server behind a button, friendlier if you would rather not touch a command line. Open WebUI puts a polished browser chat interface in front of either. And when you need to serve a model to several people or a production application at real throughput, vLLM is the serving engine built for that job, with the batching and memory techniques the lighter tools leave out.

5.2The first run

The shortest path from nothing to a running model is two commands. With Ollama, you install it, then pull and run a small model in one step:

# install Ollama on macOS or Linux, then pull and run a small model
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.2:8b

# it now serves an API at localhost:11434 that speaks the OpenAI protocol —
# point any compatible client at that URL and change the model name
From nothing to a local model.

Then do the one test that proves it is yours: disconnect from the network and ask it something. If it answers, the model is running on your machine and nothing is leaving it. While you are there, apply the habits from Chapter 3 — confirm the tool's telemetry is off, make sure the model name has no -cloud suffix, and don't expose the local server to your network unless you mean to. The model that fits your hardware, runs offline, and keeps its port to itself is the one you actually control.

Lab: stand one up and pull the plug

Install a runner — Ollama if you like a terminal, LM Studio if you prefer a window — and pull a model sized to the tier you landed on in Chapter 4. Have a short conversation. Then turn off your wifi and have another one. That second conversation, happening with no internet at all, is the whole point of this unit made concrete: a capable model answering your questions with nothing crossing the boundary of your machine.

Chapter Six

When local wins


Reading Synthesis of Chapters 1–5; total-cost-of-ownership analyses, treated as heuristics rather than fixed thresholds; and the companion guide Who the Agent Rewards on the jagged frontier.

6.1The honest economics

For all the sovereignty argument, most people first ask whether local is cheaper, and the honest answer is: it depends on volume, and on whether you count your own time. As a rough rule, below something like a billion tokens a day a hosted API usually wins once you price the hours you would spend running and maintaining your own machine. Self-hosting earns its keep in two situations: at genuine scale, where the per-token bill dwarfs the hardware, or under a sovereignty or compliance requirement, where the data simply cannot leave and the question of price never comes up. The decision is not "local is cheaper" or "the cloud is cheaper" — it is a fork with two honest branches.

6.2The showcase, fairly

So here is the claim, stated as plainly and as honestly as the evidence allows. Open models now trail the closed frontier by about a season, run at roughly a fifth to a sixth of the per-token cost, and — once self-hosted — keep your data entirely in your hands. That is not "open beats the best closed model at everything." On the hardest tasks, and on the contamination-resistant benchmarks that test them, the gap is real and sometimes wide, and which side of the line a given task falls on is not always obvious from the outside — the jagged frontier that the companion guide, Who the Agent Rewards, is entirely about. The honest pitch is the one worth repeating: frontier-quality work, a short step behind, on hardware you own, at a fraction of the price, with your data never leaving the building. For a great deal of real work, that trade is not just acceptable — it is the better deal.

6.3Own it, don't rent it

This is the first of three units, and it was the foundation: you can now read a license, understand what a model needs and how it behaves, run one on your own hardware, and judge when doing so is the right call. The arc continues from here — own the model, then own the tools, then own the stack. Unit two puts the model you just stood up to work inside an agentic coding loop, comparing the open command-line tools that drive it. Unit three reaches past text into images, audio, and the rest of the open ecosystem, on the same hardware. The thread running through all three is the one this unit began with: own your tools rather than rent them, one floor at a time, and keep the authority over your own work where it belongs — with you.

Practice: make the call

Take one workload you actually have and walk it through the figure. Is its data sovereign or regulated? If so, you are done — it stays local, and cost never entered the picture. If not, is it high-volume or always-on enough that the per-token bill would outrun a one-time machine and your time to run it? Write down the answer and the one fact that decided it. That single sentence is the difference between choosing local on purpose and adopting it as a fashion — and choosing on purpose is the entire point.

Reference

Glossary


Open-weight model
A model whose trained weights you can download and run under some license — but not necessarily with the training code or data. Most "open source" models are really this.
Open-source AI
The strict bar (per the OSI): weights plus training code plus enough data information to recreate a substantially equivalent model. Rare; OLMo is the clearest example.
Weights / parameters
The trained numbers that make a model work. Their count, in billions, is the model's headline size.
Mixture of experts (MoE)
A design that quotes total and active parameters; a router uses only a subset per token, so total governs memory-to-fit while active governs speed.
Quantization
Compressing each weight to fewer bits (e.g. 4-bit Q4_K_M) to shrink a model onto affordable hardware, at a measurable cost in quality. File size on disk ≈ memory to run it.
Context window
How much text a model can read at once. Larger windows cost memory through a growing KV cache that competes with the model for room.
Temperature
The dial on randomness — low for code and facts, higher for creative work — alongside top-p and min-p. Match the model card's defaults and chat template.
Telemetry
Usage data a tool may send home. "Local" inference can still sit inside an app that phones home; verify per tool rather than assume.
Data sovereignty
Authority over one's own data — who decides, not just who can access. The CARE Principles frame it against "open data" thinking; local models are an instrument of it.
The capability gap
The lag between the best open and best closed models — about four months on a common index since early 2026, wider on harder private benchmarks.
Capacity and bandwidth
The two specs that decide a local machine: fast-memory capacity sets what model fits; memory bandwidth sets how fast it generates. Spilling out of fast memory collapses speed.
Fast memory (VRAM / unified memory)
The memory that holds a running model — a GPU's VRAM or a Mac's unified memory. The model plus its context cache must fit here, or it runs far slower or not at all.
Local runner
Software that loads and serves a model on your own hardware — llama.cpp underneath, Ollama or LM Studio to drive it, Open WebUI as a front end, vLLM to serve many at once.
Sources

Open Source Initiative, The Open Source AI Definition · G7, Vision on AI openness: opportunities and shared language (June 2026) · Epoch AI, Open models lag state-of-the-art closed models by 4 months (CC BY) · Allen Institute for AI, OLMo 3 · Meta, Llama 4 Community License · Z.ai, GLM-5.2 pricing; Anthropic, Claude pricing · Global Indigenous Data Alliance, CARE Principles for Indigenous Data Governance · privacy and offline documentation for Ollama and LM Studio; the llama.cpp and vLLM projects · NVIDIA DGX Spark and Apple Mac Studio specifications, with community hardware benchmarks.

An unofficial study text. Commentary synthesized in original wording; every figure is drawn from the cited primary research and was verified against it. Don't cite Urania or this guide — cite the primary sources above. Researched and drafted by Urania, an AI research system; edited, verified, and signed by Zach Rossmiller, who is accountable for what's published.