ZR Research

You keep hearing that AI models are biased. Instead of debating it, I built something you can show to a room: a tool that sends the same prompt to six leading models at once and puts the responses side by side, with per-model token counts, estimated cost, and a rough carbon estimate underneath each response.

It is built for classroom use at the University of Montana. The point is to make "models differ" a thing students can see, not take on faith.

How it works

You type a prompt. The server fans it out in parallel to Claude, GPT, Grok, Gemini, and two local open-weights models running on a campus GPU. Responses stream back into six columns. Every column shows token counts, estimated dollar cost, and elapsed time.

text mode — six models waiting on a prompt
college vs. workforce — the responses side by side
blind mode — vote without seeing the name
image mode — 'IT person' across five generators
code mode with execution
history across text, image, and code

A test case in the wild

A loaded question, asked six different ways.

"Should an 18-year-old in 2026 default to college or default to entering the workforce? Choose one."

When I let the models explain their reasoning, most leaned workforce-first. They cited student debt risk, AI compressing entry-level roles, uncertainty around the traditional degree premium. One chose college, leaning on lifetime-earnings data and long-term stability.

Then I changed one variable. Answer in one word only.

The distribution flipped.

Most responded: College.
One said: Workforce.
One refused to choose.

Same models. Same question. One constraint added. When the models had room to reason, several worked through tradeoffs and uncertainty. When forced into a binary, many defaulted to the historically dominant social norm. The underlying data did not change. The framing did.

Run the prompt multiple times and the split holds in aggregate, but individual models drift — some stay put every run, some flip, one ignores the one-word rule and writes an essay anyway.

That is the point of the tool. If a small constraint can flip a recommendation, prompt design is not cosmetic — it is consequential. As leaders integrating AI into real decisions, that should matter.

Models and modes

Claude Sonnet 4.6 (Anthropic)
GPT-5.2 (OpenAI)
Grok 4 (xAI)
Gemini 3 Flash (Google)
Llama 4 (Meta, local)
Phi-4 14B (Microsoft, local)
Text mode
Image mode (GPT Image 1 · Imagen 4 · Grok Imagine · SDXL · FLUX.1 Schnell)
Code mode with execution

Classroom features

Blind mode — hide model names, vote which is best
ELO leaderboard — aggregate class preferences over time
Diff view — highlight where two responses diverge
Temperature slider
Prompt history
Cost tracking per user, per day
Export to PDF / Markdown
Dark mode
Role-based limits (student / teacher / admin)
Moderation on student accounts

Stack

Next.js 16
React 19
Tailwind 4
TypeScript
PostgreSQL
Docker
Local GPU pool (SDXL / FLUX / Llama / Phi)