Defense in Depth, Part 2: Five Things I Got Wrong About LLM Reviewers
Why same-family review can become a closed loop — and how one cheap Gemini-Flash pass caught what three same-family reviewers approved.
The OpenAI API call is 10 lines of code.
In Part 1, I argued the other 95% is guardrails, evals, edge cases, and modularity. One of those ten decisions — Decision 5 — said:
> *"A naive 'second opinion' LLM makes the same mistakes as the first — same training, same context."*
That was directionally right. I also badly underestimated how far it goes.
Over the last two weeks, while working on patent architecture for a separate AI project, I ran an internal trajectory review on roughly three hours of dense design work. I used three reviewer personas — all running on the same frontier model. They approved the direction. I was about to ship.
Then I ran the same review on **a different model family** — the cheaper one. A Gemini-2.5-Flash pass caught a category of drift that all three same-family reviewers had rationalized. When I escalated to GPT-5 as a tiebreaker, it independently converged on the same class of issues Flash had named.
(Worth saying out loud: I changed more than just the model family between those runs — different session, different prompt framing, different context packaging. So I'm treating this as a strong hint, not a controlled experiment. Part 3 is the controlled version.)
Both external reviewers returned a **NO-ship verdict**, with overlapping dealbreakers. The same-family reviewers would have let me file.
That review cost me 45 minutes. Filing the patent would have cost me a lot more.
Here are the five things I got wrong about LLM reviewers — and what I now believe is the right architecture.
---
### Wrong #1: "Different information" is the differentiator
**What I thought:** Decision 5 in Part 1 solved this. Give the QA agent different inputs (patient record + primary determination + deterministic engine result). Problem fixed.
**What's actually true:** Different information is *necessary, not sufficient*. Same-family models likely share more correlated failure modes than cross-family models do — overlapping pretraining corpora, related post-training recipes, and shared RLHF norms all push them toward correlated blind spots. Different inputs don't cancel that.
**The sharper Decision 5:** Not "different information." **Different model family** (as a practical proxy for different training distribution). Your QA reviewer should come from a different lineage than your primary — Claude reviewing Gemini, GPT-5 auditing Claude, Mistral checking OpenAI. Family diversity is a useful hedge against correlated bias — an engineering heuristic, not a proven theorem. (*Caveat earned the hard way: "family" is not identical to "training distribution." Different vendors still share internet-scale pretraining corpora and RLHF norms, and same-family variants can diverge materially. "Family" is the practical proxy, not a guarantee.*)
Research anchors:
- Verga et al., *"Replacing Judges with Juries"* (2024) — a panel of smaller, diverse judges outperforms a single GPT-4 reviewer while being ~7x cheaper. Supportive evidence for panel-level diversity (PoLL does not isolate family diversity as the sole causal variable).
- Self-Preference Bias in LLM-as-a-Judge (OpenReview, 2024-25) — LLM judges exhibit self-preference bias, favoring outputs familiar to them; related work suggests family similarity can worsen the effect.
- Preference Leakage (ICLR 2026) — primarily about contamination between synthetic-data generators and judges, plus bias toward related student models. Adjacent evidence for why same-lineage judges are structurally suspect, not a direct proof of the generic same-family-reviewer claim.
---
### Wrong #2: More reviewers = linearly better
**What I thought:** Reviewers scale linearly. One is some-goodness, two is twice-as-good, etc.
**What's actually true:** The return curve is sharply front-loaded.
- **Zero → one external reviewer** is often the biggest marginal gain in practice. You go from "self-review — a closed loop" to "external signal exists." That's a structural change, not just a quantitative one. (External reviewers can still share context, rubric, or failure modes — "external" is not magic, it's just the first real gap that lets new information in.) Self-review inside the same model and same session is a well-known weak baseline; I'll publish the F1 numbers and full methodology in Part 3.
- **One → two** is a real but ordinary improvement. Useful. Not categorical.
- **Two → three** is diminishing returns unless you also add a new *axis* of diversity (new family, new modality, new role).
**The implication:** If you're cost-constrained, spend your budget on making the first external reviewer exist. Don't stack three same-family reviewers and call it robust. You have one reviewer and a hall of mirrors.
---
### Wrong #3: The reviewer has to be as smart as the author
**What I thought:** Frontier author → frontier reviewer. Anything less is a downgrade.
**What's actually true:** **Cheap-diverse can beat expensive-same-family — and in my case this week, it did.**
Concretely: the author model was a Claude-Opus-class frontier model. The reviewer that caught the drift was Gemini-2.5-Flash. At published API prices that's roughly a 25-50x cost gap per token for my mix — not a 100x gap; I had that number wrong in my head. (Gemini 2.5 Flash is $0.30/$2.50 per 1M in/out tokens; Claude Opus 4.1 is $15/$75; GPT-5.4 sits between. Full pricing: [OpenAI](https://openai.com/api/pricing/), [Google](https://ai.google.dev/gemini-api/docs/pricing), [Anthropic](https://docs.anthropic.com/en/docs/about-claude/pricing).) GPT-5 later *confirmed* the finding independently. The expensive same-family reviewer approved the direction.
Why this works when it works: the small model's lineage doesn't share the author's correlated bias, so it carries information the author can't generate internally. Sharpness isn't always the bottleneck; *independence* can be.
Caveat: one anecdote plus PoLL isn't enough to claim this is generally true. Treat it as a live hypothesis worth testing in your own eval harness, not a universal law.
This is the PoLL intuition: a jury of weaker, diverse judges can beat a single stronger judge — even when each juror is individually weaker.
---
### Wrong #4: Parallel juries are the default shape
**What I thought:** Spawn N diverse reviewers in parallel, aggregate, done.
**What's actually true:** Parallel juries are expensive. Cascade-then-jury is, in my experience, a better default — and there's a supporting body of cost-aware-cascade literature to draw from, even though none of it proves the specific cross-family-reviewer architecture I'm arguing for.
Adjacent research worth reading:
- **FrugalGPT** (Chen/Zaharia/Zou, 2023): sequence models cheap→expensive, escalate only on low confidence. 30-98% cost savings on the benchmarks they tested.
- **Cascade Routing** (Dekoninck et al., ICML 2025): combines routing + cascading — competitive with or better than either alone on their evaluations.
- **CascadeDebate** (2026): inserts a small-model ensemble at each escalation boundary; matches larger-model Pareto frontiers at a fraction of the tokens (note: their ensembles are same-base-model, not cross-family).
None of these papers is about cross-family reviewer architecture specifically. I'm extrapolating — cascades work on cost, family diversity works on independence, combining them is an engineering bet, not a proven theorem.
**The architecture I'm running:** Run the cheap cross-family reviewer first (async, background, near-free). If it flags uncertainty or disagrees with the author, *then* escalate to an expensive cross-family reviewer. If both flag the same issue — stop and fix it before any further work.
Today's incident was accidentally this exact shape: Flash ran first (one pass, cheap), GPT-5 confirmed (one pass, more expensive). Two passes caught what three same-family passes had let through.
---
### Wrong #5: This is an optimization
**What I thought:** Cross-family review is a nice-to-have. Good when you have time.
**What's actually true:** In my own practice, I've promoted it to an operating rule, not an optimization. Shipping a significant AI-generated artifact without one increasingly feels the way shipping untested code feels: possible, occasionally fine, and not a risk I want to take on load-bearing work.
The rule I now run:
> *No significant artifact — patent filing, architecture decision, production deploy, published claim — leaves my desk without independent review from a different model family. Same persona on the same underlying model is weaker than it looks: same-family models tend to share correlated failure modes.*
I'm calling this a rule, not a theorem. The asymmetry is what keeps me honest: when the review catches a real issue, the cost of being wrong is often weeks of downstream work; the cost of running the check is minutes.
---
### The meta-lesson: self-review is a closed loop
Every frontier model — Claude, GPT-5, Gemini, Mistral — is trained to be helpful, coherent, and internally consistent. That same training can make it a weaker auditor of *itself*. Not because any individual model is weak, but because the objective that makes it strong as an author is often the wrong objective for self-critique.
Andrew Ng has been arguing publicly on DeepLearning.AI that disciplined evals and error analysis are among the biggest predictors of how rapidly a team makes progress building an AI agent ([his recent post on this](https://www.linkedin.com/posts/andrewyng_deepseek-cuts-inference-costs-openai-tightens-activity-7384633283554447360-sVXe) — paraphrased here; the specific wording is his, the framing is mine). I think he's right. The corollary I'd add: the thing separating good eval practice from best-in-class is **who's allowed to run the eval.**
If the answer is "only models from the same family as the author," the eval is much weaker than it looks.
**Part 3 (coming): Cross-Context Review — testing whether even the same model in a fresh session outperforms self-review in the same session.** I'll publish the eval-harness methodology and F1 results next week.
→ If you're building eval infrastructure for agentic systems and this resonates, I'd love to compare notes.
#AIReliability #LLMEvaluation #LLMasJudge #AIEngineering #EvalDrivenDevelopment #ModelDiversity #AIAgents #MCP #FrontierAI #AppliedAI #ResponsibleAI #BuildInPublic


