Learn/Beyond Text
Equity & Sovereignty

Beyond TextJune 2026

The same open-weights idea reaches past text into images, speech, and the meaning layer behind search — and here open hasn't just caught the closed frontier, it has nearly closed the gap in every modality. The finale of the local-models cluster: own the whole stack.

This is the last of three units on running open models on your own hardware: own the model, own the tools, own the stack. The first two units ran a model for chat and for code; this one reaches into the other modalities — images, audio, and the embeddings behind local search. It draws on the labs' own model cards and license files, the Hugging Face Open ASR Leaderboard, the MTEB embedding leaderboard, the Black Forest Labs and Alibaba release notes, and public preference arenas. The media space moves even faster than text, so every model, score, and license here is dated to June 2026 — and several of these licenses are the trap this unit is partly about, so read the actual license file before you ship.

In Plain Terms

The first two units put a model to work on words — answering questions, then writing code — all on your own machine. This unit finishes the picture by reaching into everything else a modern AI stack does: making images, turning speech into text and text into speech, and building the quiet "meaning layer" that lets a machine search your own documents by what they mean rather than the words they contain.

The good news arrives early and it is genuinely surprising. In these modalities, the open models you can download and run are not the runners-up. For making images, for transcribing and synthesizing speech, and for the embeddings behind search, open models now stand level with — and sometimes ahead of — the best closed services, while running lighter than the coding models from the last unit. A frontier-quality text-to-speech voice fits on a laptop CPU. The best open embedding model tops the multilingual leaderboard. This is the part of the stack where the case for owning your tools is strongest, because you give up the least to make it.

Two honest complications keep this from being a sales pitch. The first is licensing, and it bites harder here than anywhere in the cluster: media weights are far more likely than language models to carry a non-commercial license, where you can download and play but cannot ship. The same brand can hide both a freely-usable model and a restricted one under nearly identical names — so the habit from Unit one, read the license not the marketing, becomes the difference between a product and a lawsuit. The second is video, which is the one place open still trails: you can generate good short clips locally, but the polished, long, cinematic output still belongs to the closed services, for now.

The center of this unit, though, is quieter than image or audio, and it is the real reason the cluster ends here. Embeddings let you build a private search-and-answer system over your own documents — your notes, your files, your archive — where nothing is ever sent to anyone's server. That is data sovereignty made literal: not a promise in a privacy policy, but a machine on your desk that simply never phones home. Own the model, own the tools, own the stack — and the stack, completed, is yours.

Chapter One

The whole stack


Reading Unit one, Run Your Own Model, and Unit two, Open Models That Code, for the model-plus-runner pattern and the license spectrum this unit extends.

1.1Three more modalities, one familiar shape

By now the pattern is muscle memory: an open model is weights you download, and a runner is the program that serves it on hardware you own. Units one and two applied that to text — a chat model behind Ollama, a coding model behind an agent. This unit applies the identical shape to three more kinds of work. Image generation: a diffusion model that turns a prompt into a picture. Audio: models that turn speech into text and text into speech. Embeddings: models that turn text into vectors of meaning, the engine under semantic search and retrieval. Different models, different runners, but the same arrangement every time — a model, a runner, your machine — which is why, having built the first two floors, you already know how to build these.

1.2The surprise, and the two catches

There is a reversal worth stating before the details. In text and code, the honest line from Units one and two was that open trails the closed frontier by a handful of months. In the modalities of this unit, that gap mostly disappears: open image, audio, and embedding models stand level with or ahead of the best closed offerings, and they run on lighter hardware than the trillion-parameter coding models. The case for local is strongest exactly here. The two catches that keep it honest — a license landscape that skews non-commercial, and video as the lone modality where closed still holds a real lead — are the subjects of §2.2 and Chapter 5. Everything between is the payoff.

Check your understanding

Name the three modalities this unit adds and, for each, the one sentence of the model-plus-runner pattern you already know from Units one and two. If the pattern feels obvious, good — that familiarity is the point, and it means the only genuinely new things to learn here are which models are worth running and which licenses to distrust.

Chapter Two

Images


Reading Black Forest Labs, the FLUX.2 model cards and license files and the FLUX.2 repo; the Qwen-Image card and license; Stability AI's Community License for SD 3.5; and Unit one's license spectrum.

2.1The roster

Two open lineages anchor image generation in mid-2026, with a third holding the largest ecosystem. FLUX.2 (Black Forest Labs) is the quality leader, a flow-matching family spanning a 32B [dev] model and small, fast [klein] variants. Qwen-Image (Alibaba) is the clean permissive alternative: a 20B model that is especially strong at rendering text inside images and at multilingual prompts. Stable Diffusion 3.5 (Stability AI) no longer leads on raw quality but retains the deepest ecosystem of fine-tunes, LoRAs, and control tools. All three run locally on a single consumer GPU — FLUX.2 [klein] 4B and Qwen-Image both fit in about 13 GB, and even the heavy FLUX.2 [dev], which wants more than 80 GB at full precision, drops to 18–24 GB quantized and will run on an 8 GB card with enough system memory. This is the lightest hardware story in the cluster.

2.2Read the license, not the brand

Here is where media diverges sharply from the language models of Units one and two, and where the cluster's license lesson earns its keep. The same FLUX.2 brand splits across licenses by variant: FLUX.2 [klein] 4B is Apache 2.0 — fully open, commercial use allowed — while FLUX.2 [klein] 9B and the flagship FLUX.2 [dev] are released under the FLUX Non-Commercial License, where you may download and experiment but may not ship commercially without a separate agreement, and the top-end [pro] tier is closed, API-only, not downloadable at all. Qwen-Image, by contrast, is cleanly Apache 2.0 throughout, and SD 3.5 sits between the two under Stability's Community License, which is free until your organization crosses a revenue threshold (around one million dollars a year) and then requires a paid license. Three flagships, three different answers to "can I ship this?" — and the answer lives only in the license file, never in the announcement.

2.3Where open stands

On quality, the practitioner consensus in mid-2026 is that open image generation has effectively reached the closed frontier for local and self-hosted work. FLUX.2 and Qwen-Image are treated as daily drivers — Qwen tends to win on realism and text rendering, FLUX on prompt adherence and fine control — and both are routinely judged competitive with the closed image services for most production needs. Closed tools may keep an edge in particular aesthetic styles or sheer polish, but they no longer hold a categorical lead, and they cannot match the open ecosystem's control surface: the fine-tunes, the LoRAs, the node-based workflows in ComfyUI, the ability to run a million images at zero marginal cost on your own card. For images, owning the tool costs you almost nothing and buys you control you cannot rent.

Practice

Open the actual license file for one model you'd want to use — not the model card blurb, the license. For a FLUX.2 variant, confirm whether it's the Apache 4B or a non-commercial one; for Qwen-Image or SD 3.5, find the clause about commercial use and revenue. Write one sentence: "I can / cannot ship the output of this model in a paid product, because the license says ___." That sentence is the entire skill of this chapter.

Chapter Three

Audio


Reading The Hugging Face Open ASR Leaderboard and its companion paper; the Whisper large-v3 and Kokoro-82M model cards; NVIDIA's NeMo Canary and Parakeet cards.

3.1Speech in: transcription

Turning speech into text is the most settled corner of this unit. OpenAI's Whisper (large-v3) remains the multilingual workhorse — 99 languages, permissively licensed, and runnable on a CPU through whisper.cpp or fast on a GPU through faster-whisper. For English specifically, NVIDIA's models have taken the lead: Canary-Qwen 2.5B tops the Open ASR Leaderboard at a 5.63% word error rate, and the Parakeet family is the throughput champion, transcribing far faster than real time. Of the eighty-some models on that leaderboard in early 2026, the large majority are open. The honest qualifier: closed services such as ElevenLabs Scribe still lead on the hardest long-form and multilingual transcription, so the open win is clearest for English and for short-form audio. For most local transcription needs, an open model is not a compromise — it is the better choice, with no per-minute bill and nothing leaving the machine.

3.2Speech out: synthesis

Text-to-speech is the modality that most surprises people, because the standout model is tiny. Kokoro-82M is an 82-million-parameter model — a few hundred megabytes — released under Apache 2.0, that runs comfortably on a CPU and reached the top of the TTS preference arena in January 2026, beating models a hundred times its size. A frontier-class voice that runs on any laptop with no GPU and no recurring cost is the single clearest equity story in this cluster. Above it sit more expressive options for when you need voice cloning or emotional range: Chatterbox (Resemble AI, MIT-licensed) is notable enough that in blind tests it won a clear majority preference against ElevenLabs, the leading closed service. Closed TTS keeps an edge in the most natural prosody and zero-shot cloning polish, but for narration, prototyping, accessibility, and any cost-sensitive use, open is excellent and free.

Practice

Transcribe one audio file locally — a voice memo, a meeting recording — with a local Whisper runner, fully offline. Note two things: that it ran with no network connection, and roughly how fast it went on your hardware. Then ask what you currently pay, in money or in data exposure, to send that same audio to a cloud transcription service. That difference is the case for local audio in one experiment.

Chapter Four

The knowledge layer


Reading The MTEB embedding leaderboard and the Qwen3-Embedding repository and cards; BAAI's BGE-M3 and Nomic's embedding cards; and the local vector-store documentation (Chroma, Qdrant, FAISS).

4.1What embeddings are, and why they're the point

An embedding model turns a piece of text into a vector — a long list of numbers — positioned so that texts with similar meaning land near each other. That single trick is the engine under semantic search and under retrieval-augmented generation (RAG): the pattern where, to answer a question about your documents, a system finds the passages whose embeddings sit closest to the question's and hands those to a language model. Embeddings are the least glamorous modality in this unit and the most important to it, because a local embedding model plus a local vector store plus the Unit-one language model is a complete, private question-answering system over your own files — and nothing in that loop ever leaves your machine.

4.2Open owns this modality

If there is one place open has flatly won, it is embeddings. Qwen3-Embedding-8B sits at the top of the MTEB multilingual leaderboard with a score of 70.58, ahead of the closed embedding APIs from OpenAI and Google on multilingual retrieval; it comes in 0.6B, 4B, and 8B sizes, the smaller of which run near-instantly on modest hardware. BGE-M3 (BAAI, MIT-licensed) is the most-downloaded open embedder and pairs dense and sparse retrieval in one model; nomic-embed is a tiny, permissive workhorse. The one honest qualifier is narrow: on English-only retrieval, Google's closed Gemini embedding still holds a slim lead. Everywhere else, and especially across languages, the open models are at or beyond the frontier — and the economics are not close. At a steady hundred million tokens a day, a closed embedding API runs on the order of thirteen thousand dollars a month; the same work self-hosted is a few hundred. (All figures as of June 2026; verify before quoting.)

4.3Own your knowledge

This is where the cluster's thread ties off. A private RAG system over your own corpus is the same idea, scaled to a person or a team, as keeping your knowledge in portable files you control rather than inside a vendor's catalog. The model embeds; the vector store holds; the language model answers; and because every piece is open and local, the corpus — which may be the most sensitive thing you own — is never uploaded, never trained on, never subject to anyone's terms changing under you. The earlier units kept your conversations and your code on the machine. This one keeps your knowledge there too, which is the form of sovereignty that matters most and the reason the whole cluster has been building toward this chapter.

Practice

Embed a folder of your own notes with a local model and run one similarity query against it — "what did I write about X" — entirely offline. Watch it return the right passages without a single network call. You have just built, in miniature, the private knowledge layer this chapter is about; scaling it up is engineering, but the sovereignty is already complete at this size.

Chapter Five

The honest gap


Reading The open-video model cards and licenses (Alibaba's Wan, Tencent's HunyuanVideo, Lightricks' LTX-Video) and a public text-to-video preference arena; and the companion guide Who the Agent Rewards on verification and the jagged frontier.

5.1Video: where open still trails, and by how much

Every other modality in this unit reached the frontier; video is where open still trails — though by less than its reputation suggests. The open flagship is Alibaba's Wan family, released under Apache 2.0: Wan 2.2 runs its small 1.3B variant on an 8 GB card, and the April 2026 Wan 2.7 debuted at or near the top of public text-to-video preference leaderboards. So open video is genuinely good now. What closed still holds is the premium ceiling — the top services (Sora 2, Veo 3.1, Kling) lead on native 4K, long and continuous shots, physical realism in motion, and precise lip-sync. The honest 2026 practice is tiered routing: generate the bulk of clips with an open model locally at zero marginal cost, and reach for a closed service only for the hero shot that needs the ceiling. Video is also the heaviest modality here — the larger Wan variants want multiple GPUs — so it strains the rig in a way image and audio do not.

5.2The media-license skew, and softer benchmarks

Two cautions carry across all of this unit's modalities. The first is the license skew already met in Chapter 2: media weights trend non-commercial far more than language models do — FLUX.2 [dev], many video models, some music models — so "open" in media demands the license read even more than in text, where Apache and MIT are the norm. The second is that the benchmarks are softer. Image, audio, and video quality are partly subjective, judged by preference arenas and listening tests rather than the kind of test suite that grades code — and Unit two showed that even code's hard benchmarks were inflated. The lesson transfers: trust your own eyes and ears over any leaderboard, and treat a single headline score as a starting hypothesis, not a verdict.

5.3The reconciliation

The companion guide Who the Agent Rewards drew the line that applies here too: a model is the input, and the judgment about whether its output is actually good is supplied by you. That is why the verification habit matters more as these tools get better, and why the jagged frontier — some tasks easily within a model's reach, others of similar appearance just beyond it — is as real for a video prompt or a tricky transcription as for code. Owning the stack does not remove that judgment; it concentrates it in your hands, which is exactly where the cluster has argued it belongs.

Check your understanding

Pick the one modality in this unit where you'd still reach for a closed service, and say precisely why — the specific capability at the premium ceiling that open doesn't yet match for your work. If the honest answer is "none of them, for what I actually do," that is the finding: for most real work, the whole stack is already local.

Chapter Six

Own the stack


Reading Synthesis of the cluster; the runner docs (Ollama, ComfyUI) and the local vector-store and TTS/ASR tooling above; and Units one and two, Run Your Own Model and Open Models That Code.

6.1The whole rig, assembled

Stand the three units together and you have a complete, sovereign, multimodal workstation on hardware you own. The Unit-one language model answers and reasons; the Unit-two coding agent writes and fixes code; this unit adds image generation through ComfyUI, speech in and out through Whisper and Kokoro, and a private knowledge layer through an embedding model and a vector store. One rig, one set of skills — an open model behind a runner — repeated across every modality. The connective tissue is the same throughout: open weights you can keep, runners you control, and a machine boundary that nothing crosses.

6.2The decision, per modality

When is local the right call? For this unit's modalities the answer is unusually easy. Image, audio, and embeddings: run local by default — open is at the frontier, the hardware is light, the licenses (read them) are often permissive, and the privacy and cost wins are large. Video: run open for the volume and reach for a closed service only when you need the premium ceiling. And as always, sovereignty decides first: if the data — the source images, the recorded voices, the corpus — cannot leave, the decision is made before cost or quality enters. The cluster's rule holds to the end: sovereignty is a dial, not a switch, and in these modalities it sits nearly all the way toward local.

6.3The close

Three units, one argument. Own the model: stand up an open language model and judge when local beats the API. Own the tools: drive it with an open coding agent, your source never leaving the building. Own the stack: complete the rig with images, audio, and a private knowledge layer, every modality at or near the frontier, on hardware that is yours. The thread was never really about avoiding a bill — it was about keeping authority over your own work where it belongs: your conversations, your code, your knowledge, your tools. The open ecosystem made that possible across the whole stack, faster than almost anyone expected. What you build on it is the part that was always going to be yours.

Practice · put it together

Stand up two new modalities from this unit on your rig — say, local transcription and a small RAG index over your own notes. Then write the cluster's closing audit for your own situation: for each of the five workloads (chat, code, image, audio, knowledge), local or closed, and the one-sentence reason. If most of your answers are "local," you didn't just read the cluster — you own the stack.

Reference

Glossary


Diffusion model
The dominant architecture for open image (and video) generation: it starts from noise and refines toward an image that matches the prompt over a series of steps.
FLUX.2
Black Forest Labs' open image family — a 32B [dev] model and small fast [klein] variants — and the cluster's sharpest license lesson, since the same brand spans Apache and non-commercial terms by variant.
Non-commercial license
A license that lets you download and use a model for research or personal work but not ship it commercially without a separate agreement. Common for media weights, rare for language models.
ASR (automatic speech recognition)
Speech-to-text. Whisper is the open multilingual standard; NVIDIA's Canary and Parakeet lead on English accuracy and speed.
TTS (text-to-speech)
Speech synthesis. Kokoro-82M is the standout open model — frontier-quality, Apache-licensed, and small enough to run on a CPU.
Open ASR Leaderboard
The Hugging Face benchmark ranking transcription models by word error rate and speed; the citable primary for who actually leads, most of them open.
Embedding
A vector of numbers representing a text's meaning, positioned so similar texts sit near each other. The engine under semantic search and RAG.
RAG (retrieval-augmented generation)
Answering a question by first retrieving the most relevant passages from your own documents (via embeddings) and handing them to a language model. The local knowledge layer.
Vector store
A database built to find the nearest embeddings to a query fast — Chroma, Qdrant, FAISS. The "memory" half of a local RAG system.
MTEB
The Massive Text Embedding Benchmark, the standard leaderboard for embedding models, where open models — led by Qwen3-Embedding — sit at or above the closed APIs.
Premium ceiling
The narrow band of capability closed services still hold over open ones in these modalities — chiefly cinematic video (4K, length, physical realism) — reached for deliberately rather than by default.
Tiered routing
The practical 2026 pattern for video: generate the bulk of clips with an open model locally and escalate only the shots that need the premium ceiling to a closed service.
Sources

Black Forest Labs, the FLUX.2 model cards, license files, and inference repo · Alibaba, the Qwen-Image and Qwen3-Embedding cards and licenses, and the Wan video models · Stability AI's Community License for SD 3.5 · the Hugging Face Open ASR Leaderboard and its companion paper · the Whisper and Kokoro-82M cards · the MTEB leaderboard · and the companion guide Who the Agent Rewards.

An unofficial study text; models, scores, licenses, and prices are dated to June 2026 and move quickly — and several media licenses are restrictive, so verify the license file against the primary before you ship. Commentary synthesized in original wording; every figure is drawn from the cited primary sources and was verified against them. Don't cite Urania or this guide — cite the primary sources above. Researched and drafted by Urania, an AI research system; edited, verified, and signed by Zach Rossmiller, who is accountable for what's published.