Beyond Text

Chapter One

The whole stack

Reading Unit one, Run Your Own Model, and Unit two, Open Models That Code, for the model-plus-runner pattern and the license spectrum this unit extends.

1.1Three more modalities, one familiar shape

By now the pattern is muscle memory: an open model is weights you download, and a runner is the program that serves it on hardware you own. Units one and two applied that to text — a chat model behind Ollama, a coding model behind an agent. This unit applies the identical shape to three more kinds of work. Image generation: a diffusion model that turns a prompt into a picture. Audio: models that turn speech into text and text into speech. Embeddings: models that turn text into vectors of meaning, the engine under semantic search and retrieval. Different models, different runners, but the same arrangement every time — a model, a runner, your machine — which is why, having built the first two floors, you already know how to build these.

Figure 1. The completed stack. The first two units built the text and code floors; this one adds image and audio generation and the embedding layer. Every floor is the same construction — an open model behind a runner, on hardware you own — which is the whole reason the skills transfer.

1.2The surprise, and the two catches

There is a reversal worth stating before the details. In text and code, the honest line from Units one and two was that open trails the closed frontier by a handful of months. In the modalities of this unit, that gap mostly disappears: open image, audio, and embedding models stand level with or ahead of the best closed offerings, and they run on lighter hardware than the trillion-parameter coding models. The case for local is strongest exactly here. The two catches that keep it honest — a license landscape that skews non-commercial, and video as the lone modality where closed still holds a real lead — are the subjects of §2.2 and Chapter 5. Everything between is the payoff.

Check your understanding

Name the three modalities this unit adds and, for each, the one sentence of the model-plus-runner pattern you already know from Units one and two. If the pattern feels obvious, good — that familiarity is the point, and it means the only genuinely new things to learn here are which models are worth running and which licenses to distrust.

Chapter Two

Images

Reading Black Forest Labs, the FLUX.2 model cards and license files and the FLUX.2 repo; the Qwen-Image card and license; Stability AI's Community License for SD 3.5; and Unit one's license spectrum.

2.1The roster

Two open lineages anchor image generation in mid-2026, with a third holding the largest ecosystem. FLUX.2 (Black Forest Labs) is the quality leader, a flow-matching family spanning a 32B [dev] model and small, fast [klein] variants. Qwen-Image (Alibaba) is the clean permissive alternative: a 20B model that is especially strong at rendering text inside images and at multilingual prompts. Stable Diffusion 3.5 (Stability AI) no longer leads on raw quality but retains the deepest ecosystem of fine-tunes, LoRAs, and control tools. All three run locally on a single consumer GPU — FLUX.2 [klein] 4B and Qwen-Image both fit in about 13 GB, and even the heavy FLUX.2 [dev], which wants more than 80 GB at full precision, drops to 18–24 GB quantized and will run on an 8 GB card with enough system memory. This is the lightest hardware story in the cluster.

2.2Read the license, not the brand

Here is where media diverges sharply from the language models of Units one and two, and where the cluster's license lesson earns its keep. The same FLUX.2 brand splits across licenses by variant: FLUX.2 [klein] 4B is Apache 2.0 — fully open, commercial use allowed — while FLUX.2 [klein] 9B and the flagship FLUX.2 [dev] are released under the FLUX Non-Commercial License, where you may download and experiment but may not ship commercially without a separate agreement, and the top-end [pro] tier is closed, API-only, not downloadable at all. Qwen-Image, by contrast, is cleanly Apache 2.0 throughout, and SD 3.5 sits between the two under Stability's Community License, which is free until your organization crosses a revenue threshold (around one million dollars a year) and then requires a paid license. Three flagships, three different answers to "can I ship this?" — and the answer lives only in the license file, never in the announcement.

Figure 2. The image roster by license and hardware. Reading down is the trap: the same FLUX.2 brand appears in the freely-usable top band (klein 4B) and the restricted bottom band (dev, klein 9B). Reading across is the reassurance: every model fits a single consumer card. The license is the axis that decides whether you have a product or only a toy.

2.3Where open stands

On quality, the practitioner consensus in mid-2026 is that open image generation has effectively reached the closed frontier for local and self-hosted work. FLUX.2 and Qwen-Image are treated as daily drivers — Qwen tends to win on realism and text rendering, FLUX on prompt adherence and fine control — and both are routinely judged competitive with the closed image services for most production needs. Closed tools may keep an edge in particular aesthetic styles or sheer polish, but they no longer hold a categorical lead, and they cannot match the open ecosystem's control surface: the fine-tunes, the LoRAs, the node-based workflows in ComfyUI, the ability to run a million images at zero marginal cost on your own card. For images, owning the tool costs you almost nothing and buys you control you cannot rent.

Practice

Open the actual license file for one model you'd want to use — not the model card blurb, the license. For a FLUX.2 variant, confirm whether it's the Apache 4B or a non-commercial one; for Qwen-Image or SD 3.5, find the clause about commercial use and revenue. Write one sentence: "I can / cannot ship the output of this model in a paid product, because the license says ___." That sentence is the entire skill of this chapter.

Chapter Three

Audio

Reading The Hugging Face Open ASR Leaderboard and its companion paper; the Whisper large-v3 and Kokoro-82M model cards; NVIDIA's NeMo Canary and Parakeet cards.

3.1Speech in: transcription

Turning speech into text is the most settled corner of this unit. OpenAI's Whisper (large-v3) remains the multilingual workhorse — 99 languages, permissively licensed, and runnable on a CPU through whisper.cpp or fast on a GPU through faster-whisper. For English specifically, NVIDIA's models have taken the lead: Canary-Qwen 2.5B tops the Open ASR Leaderboard at a 5.63% word error rate, and the Parakeet family is the throughput champion, transcribing far faster than real time. Of the eighty-some models on that leaderboard in early 2026, the large majority are open. The honest qualifier: closed services such as ElevenLabs Scribe still lead on the hardest long-form and multilingual transcription, so the open win is clearest for English and for short-form audio. For most local transcription needs, an open model is not a compromise — it is the better choice, with no per-minute bill and nothing leaving the machine.

3.2Speech out: synthesis

Text-to-speech is the modality that most surprises people, because the standout model is tiny. Kokoro-82M is an 82-million-parameter model — a few hundred megabytes — released under Apache 2.0, that runs comfortably on a CPU and reached the top of the TTS preference arena in January 2026, beating models a hundred times its size. A frontier-class voice that runs on any laptop with no GPU and no recurring cost is the single clearest equity story in this cluster. Above it sit more expressive options for when you need voice cloning or emotional range: Chatterbox (Resemble AI, MIT-licensed) is notable enough that in blind tests it won a clear majority preference against ElevenLabs, the leading closed service. Closed TTS keeps an edge in the most natural prosody and zero-shot cloning polish, but for narration, prototyping, accessibility, and any cost-sensitive use, open is excellent and free.

Figure 3. The audio loop, entirely local. Speech in through an ASR model, optional reasoning through the Unit-one language model, speech out through a tiny TTS model — none of it leaving the machine, and the lightest hardware in the unit: a CPU is enough for transcription and for Kokoro.

Practice

Transcribe one audio file locally — a voice memo, a meeting recording — with a local Whisper runner, fully offline. Note two things: that it ran with no network connection, and roughly how fast it went on your hardware. Then ask what you currently pay, in money or in data exposure, to send that same audio to a cloud transcription service. That difference is the case for local audio in one experiment.

Chapter Four

The knowledge layer

Reading The MTEB embedding leaderboard and the Qwen3-Embedding repository and cards; BAAI's BGE-M3 and Nomic's embedding cards; and the local vector-store documentation (Chroma, Qdrant, FAISS).

4.1What embeddings are, and why they're the point

An embedding model turns a piece of text into a vector — a long list of numbers — positioned so that texts with similar meaning land near each other. That single trick is the engine under semantic search and under retrieval-augmented generation (RAG): the pattern where, to answer a question about your documents, a system finds the passages whose embeddings sit closest to the question's and hands those to a language model. Embeddings are the least glamorous modality in this unit and the most important to it, because a local embedding model plus a local vector store plus the Unit-one language model is a complete, private question-answering system over your own files — and nothing in that loop ever leaves your machine.

4.2Open owns this modality

If there is one place open has flatly won, it is embeddings. Qwen3-Embedding-8B sits at the top of the MTEB multilingual leaderboard with a score of 70.58, ahead of the closed embedding APIs from OpenAI and Google on multilingual retrieval; it comes in 0.6B, 4B, and 8B sizes, the smaller of which run near-instantly on modest hardware. BGE-M3 (BAAI, MIT-licensed) is the most-downloaded open embedder and pairs dense and sparse retrieval in one model; nomic-embed is a tiny, permissive workhorse. The one honest qualifier is narrow: on English-only retrieval, Google's closed Gemini embedding still holds a slim lead. Everywhere else, and especially across languages, the open models are at or beyond the frontier — and the economics are not close. At a steady hundred million tokens a day, a closed embedding API runs on the order of thirteen thousand dollars a month; the same work self-hosted is a few hundred. (All figures as of June 2026; verify before quoting.)

Figure 4. The sovereign knowledge loop, and the quiet heart of the cluster. Your documents are embedded once into a local vector store; each question retrieves the closest passages and hands them to the Unit-one model for an answer. The entire ring is inside your machine — which is what data sovereignty means when it stops being a policy and becomes a wiring diagram.

4.3Own your knowledge

This is where the cluster's thread ties off. A private RAG system over your own corpus is the same idea, scaled to a person or a team, as keeping your knowledge in portable files you control rather than inside a vendor's catalog. The model embeds; the vector store holds; the language model answers; and because every piece is open and local, the corpus — which may be the most sensitive thing you own — is never uploaded, never trained on, never subject to anyone's terms changing under you. The earlier units kept your conversations and your code on the machine. This one keeps your knowledge there too, which is the form of sovereignty that matters most and the reason the whole cluster has been building toward this chapter.

Practice

Embed a folder of your own notes with a local model and run one similarity query against it — "what did I write about X" — entirely offline. Watch it return the right passages without a single network call. You have just built, in miniature, the private knowledge layer this chapter is about; scaling it up is engineering, but the sovereignty is already complete at this size.

Chapter Five

The honest gap

Reading The open-video model cards and licenses (Alibaba's Wan, Tencent's HunyuanVideo, Lightricks' LTX-Video) and a public text-to-video preference arena; and the companion guide Who the Agent Rewards on verification and the jagged frontier.

5.1Video: where open still trails, and by how much

Every other modality in this unit reached the frontier; video is where open still trails — though by less than its reputation suggests. The open flagship is Alibaba's Wan family, released under Apache 2.0: Wan 2.2 runs its small 1.3B variant on an 8 GB card, and the April 2026 Wan 2.7 debuted at or near the top of public text-to-video preference leaderboards. So open video is genuinely good now. What closed still holds is the premium ceiling — the top services (Sora 2, Veo 3.1, Kling) lead on native 4K, long and continuous shots, physical realism in motion, and precise lip-sync. The honest 2026 practice is tiered routing: generate the bulk of clips with an open model locally at zero marginal cost, and reach for a closed service only for the hero shot that needs the ceiling. Video is also the heaviest modality here — the larger Wan variants want multiple GPUs — so it strains the rig in a way image and audio do not.

Figure 5. The reshaped picture. Open has converged on the closed frontier across every modality — the bars nearly reach the line. What closed still holds is a premium ceiling (the accent remainder): effectively gone in image, audio, and embeddings, slim in English-only embeddings, and widest in video, where 4K, length, and physics still favor the closed services. Not "open behind" — open caught up, and the remaining edge is narrowing.

5.2The media-license skew, and softer benchmarks

Two cautions carry across all of this unit's modalities. The first is the license skew already met in Chapter 2: media weights trend non-commercial far more than language models do — FLUX.2 [dev], many video models, some music models — so "open" in media demands the license read even more than in text, where Apache and MIT are the norm. The second is that the benchmarks are softer. Image, audio, and video quality are partly subjective, judged by preference arenas and listening tests rather than the kind of test suite that grades code — and Unit two showed that even code's hard benchmarks were inflated. The lesson transfers: trust your own eyes and ears over any leaderboard, and treat a single headline score as a starting hypothesis, not a verdict.

5.3The reconciliation

The companion guide Who the Agent Rewards drew the line that applies here too: a model is the input, and the judgment about whether its output is actually good is supplied by you. That is why the verification habit matters more as these tools get better, and why the jagged frontier — some tasks easily within a model's reach, others of similar appearance just beyond it — is as real for a video prompt or a tricky transcription as for code. Owning the stack does not remove that judgment; it concentrates it in your hands, which is exactly where the cluster has argued it belongs.

Check your understanding

Pick the one modality in this unit where you'd still reach for a closed service, and say precisely why — the specific capability at the premium ceiling that open doesn't yet match for your work. If the honest answer is "none of them, for what I actually do," that is the finding: for most real work, the whole stack is already local.

Chapter Six

Own the stack

Reading Synthesis of the cluster; the runner docs (Ollama, ComfyUI) and the local vector-store and TTS/ASR tooling above; and Units one and two, Run Your Own Model and Open Models That Code.

6.1The whole rig, assembled

Stand the three units together and you have a complete, sovereign, multimodal workstation on hardware you own. The Unit-one language model answers and reasons; the Unit-two coding agent writes and fixes code; this unit adds image generation through ComfyUI, speech in and out through Whisper and Kokoro, and a private knowledge layer through an embedding model and a vector store. One rig, one set of skills — an open model behind a runner — repeated across every modality. The connective tissue is the same throughout: open weights you can keep, runners you control, and a machine boundary that nothing crosses.

Figure 6. The assembled stack. Five workloads on one rig, every one an open model behind a runner, the whole of it inside your machine. The only thing outside the boundary is the narrow premium ceiling — chiefly cinematic video — that you reach for deliberately, not by default. That is the cluster, complete.

6.2The decision, per modality

When is local the right call? For this unit's modalities the answer is unusually easy. Image, audio, and embeddings: run local by default — open is at the frontier, the hardware is light, the licenses (read them) are often permissive, and the privacy and cost wins are large. Video: run open for the volume and reach for a closed service only when you need the premium ceiling. And as always, sovereignty decides first: if the data — the source images, the recorded voices, the corpus — cannot leave, the decision is made before cost or quality enters. The cluster's rule holds to the end: sovereignty is a dial, not a switch, and in these modalities it sits nearly all the way toward local.

6.3The close

Three units, one argument. Own the model: stand up an open language model and judge when local beats the API. Own the tools: drive it with an open coding agent, your source never leaving the building. Own the stack: complete the rig with images, audio, and a private knowledge layer, every modality at or near the frontier, on hardware that is yours. The thread was never really about avoiding a bill — it was about keeping authority over your own work where it belongs: your conversations, your code, your knowledge, your tools. The open ecosystem made that possible across the whole stack, faster than almost anyone expected. What you build on it is the part that was always going to be yours.

Practice · put it together

Stand up two new modalities from this unit on your rig — say, local transcription and a small RAG index over your own notes. Then write the cluster's closing audit for your own situation: for each of the five workloads (chat, code, image, audio, knowledge), local or closed, and the one-sentence reason. If most of your answers are "local," you didn't just read the cluster — you own the stack.

Beyond TextJune 2026

The whole stack

1.1Three more modalities, one familiar shape

1.2The surprise, and the two catches

Images

2.1The roster

2.2Read the license, not the brand

2.3Where open stands

Audio

3.1Speech in: transcription

3.2Speech out: synthesis

The knowledge layer

4.1What embeddings are, and why they're the point

4.2Open owns this modality

4.3Own your knowledge

The honest gap

5.1Video: where open still trails, and by how much

5.2The media-license skew, and softer benchmarks

5.3The reconciliation

Own the stack

6.1The whole rig, assembled

6.2The decision, per modality

6.3The close

Glossary