You keep hearing that AI models are biased. Instead of debating it, I built something you can show to a room: a tool that sends the same prompt to six leading models at once and puts the responses side by side, with per-model token counts, estimated cost, and a rough carbon estimate underneath each response.
It is built for classroom use at the University of Montana. The point is to make "models differ" a thing students can see, not take on faith.
How it works
You type a prompt. The server fans it out in parallel to Claude, GPT, Grok, Gemini, and two local open-weights models running on a campus GPU. Responses stream back into six columns. Every column shows token counts, estimated dollar cost, and elapsed time.
text mode — six models waiting on a prompt college vs. workforce — the responses side by side blind mode — vote without seeing the name image mode — 'IT person' across five generators code mode with execution history across text, image, and code
A test case in the wild
A loaded question, asked six different ways.
"Should an 18-year-old in 2026 default to college or default to entering the workforce? Choose one."
When I let the models explain their reasoning, most leaned workforce-first. They cited student debt risk, AI compressing entry-level roles, uncertainty around the traditional degree premium. One chose college, leaning on lifetime-earnings data and long-term stability.
Then I changed one variable. Answer in one word only.
The distribution flipped.
- Most responded: College.
- One said: Workforce.
- One refused to choose.
Same models. Same question. One constraint added. When the models had room to reason, several worked through tradeoffs and uncertainty. When forced into a binary, many defaulted to the historically dominant social norm. The underlying data did not change. The framing did.
Run the prompt multiple times and the split holds in aggregate, but individual models drift — some stay put every run, some flip, one ignores the one-word rule and writes an essay anyway.
That is the point of the tool. If a small constraint can flip a recommendation, prompt design is not cosmetic — it is consequential. As leaders integrating AI into real decisions, that should matter.
Models and modes
- Claude Sonnet 4.6 (Anthropic)
- GPT-5.2 (OpenAI)
- Grok 4 (xAI)
- Gemini 3 Flash (Google)
- Llama 4 (Meta, local)
- Phi-4 14B (Microsoft, local)
- Text mode
- Image mode (GPT Image 1 · Imagen 4 · Grok Imagine · SDXL · FLUX.1 Schnell)
- Code mode with execution
Classroom features
- Blind mode — hide model names, vote which is best
- ELO leaderboard — aggregate class preferences over time
- Diff view — highlight where two responses diverge
- Temperature slider
- Prompt history
- Cost tracking per user, per day
- Export to PDF / Markdown
- Dark mode
- Role-based limits (student / teacher / admin)
- Moderation on student accounts
Stack
- Next.js 16
- React 19
- Tailwind 4
- TypeScript
- PostgreSQL
- Docker
- Local GPU pool (SDXL / FLUX / Llama / Phi)