Defense in Depth, Part 3: The Variable Jury Beats Judge Didn't Control For
If cross-family review catches what same-family misses — is the real variable family, or just context? Here's the harness I built to find out, and the hypothesis I'm publicly testing.
In Part 2, I argued that same-family LLM reviewers can become a closed loop, and that one cheap Gemini-2.5-Flash pass caught a category of drift three same-family reviewers had rationalized.
I also promised: “Part 3, coming: Cross-Context Review — testing whether even the same model in a fresh session outperforms self-review in the same session. I’ll publish the eval-harness methodology and F1 results next week.”
I’m going to partially break that promise. Here’s the methodology. The F1 numbers are coming in ~2 weeks. I’m publishing the harness first, on purpose, because most production systems don’t have any cross-context review at all — and the methodology is the load-bearing part.
This is the article I’d want to read before I saw someone else’s F1 table.
The sharper question Part 2 raises
Part 2 made a family-level claim: different model families carry independent correlated bias; cross-family review catches what same-family cannot.
But there’s a confound. In the Part 2 incident I changed more than just the model family between runs: different session, different prompt framing, different context packaging. So when Flash caught drift that Claude missed, at least three variables changed at once:
1. Family (Claude → Gemini)
2. Session (stale context → fresh context)
3. Framing (original prompt → review prompt)
Any of the three, individually, could have been the causal variable. Most likely it was some combination. Part 2 was a strong hint, not a controlled experiment.
The sharper question: if I hold family constant and vary only session/context, does self-review improve?
If yes: cross-context review is a cheap proxy — most teams could run it for free, same API key, same model.
If no: family is the load-bearing variable, and you actually need a second vendor relationship.
Either answer is useful. I don't know which one is true yet. Hence the harness.
The hypotheses, before the numbers
I’m publishing my prior because I want to be held to it.
H₀ (null): Same-session self-review and fresh-session self-review are statistically indistinguishable on catching seeded flaws.
H₁ (weak effect): Fresh-session self-review catches strictly more than same-session, but strictly less than cross-family review.
H₂ (strong effect): Cross-family review dominates both; fresh-session is a cheap-but-weak proxy.
My prior: H₁. Somewhere between 25% and 45% of flaws are session-dependent (the model latches onto a framing and self-consistency protects it). A fresh session breaks that loop but can’t escape the family’s shared bias on the remaining ~55–75%.
I am publicly committed to publishing the F1 table in ~2 weeks, whether it confirms H₁ or refutes it. If H₀ survives, I’ll say so. If H₂ turns out right, I’ll update my prior in writing. Subscribe if you want the receipt.
The harness (publishable; your corpus, your mileage)
Here’s the design. It fits on one page. If you build eval infrastructure, this is the part you can fork.
1. Corpus
Fifty seeded-flaw artifacts. Each is a design document, architectural sketch, or technical memo (300–800 words). Each contains 1–3 known flaws drawn from a taxonomy of five failure modes I’ve seen repeatedly in the last six months of my own work:
Ontological overclaim — “X is like Y” silently becoming “X is Y.”
Codification-as-closure — writing down the lesson instead of remediating the underlying bug.
Velocity-as-signal — mistaking rapid elaboration for clarity.
Symmetry-generated frame — a three-part structure that’s too clean to be inferred.
Analogy-as-argument — a comparison doing the work that reasoning should.
Each artifact is hand-authored and human-graded. Ground truth is the flaw list committed before any LLM sees it.
2. Conditions
Authoring pipeline. Each artifact is a Claude-authored first draft (Claude is the author model under test), into which I hand-inject 1–3 flaws drawn from the taxonomy above. Ground truth — the exact flaw list, category, and location — is committed to a locked file before any reviewer model sees the artifact. The artifact itself exposes no authoring trace; only the final text is visible to reviewers under any condition. A zero-flaw control set (10 artifacts) is authored the same way but contains no injected flaws, to measure abstention discipline and overcalling.
C1 — Same-session self-review. Claude reviews the artifact in the same session that authored it. Full authoring context carried forward. (Tests: does privileged authoring context help or hurt self-review?)
C2 — Fresh-session self-review. Claude, same model snapshot, same review prompt, brand-new session with only the artifact visible — no authoring history. (Tests: does breaking the session-level context loop improve self-review?)
C3a — Cross-family review, Gemini reviewer. Gemini-2.5-Pro reviews the artifact in a fresh session with no authoring history, same review prompt as C2.
C3b — Cross-family review, GPT reviewer. GPT-5-class reviewer in a fresh session with no authoring history, same review prompt as C2.
C3-ensemble (reported separately): the union of C3a + C3b findings, to measure whether pooling across families gives additional signal beyond either family alone.
C3 is split into C3a / C3b deliberately. If one family wins cleanly and the other doesn’t, reporting them pooled would average away the causal signal this harness is designed to isolate.
All conditions use the same review prompt (published verbatim with the results). Review-order, reviewer identity, and model snapshot are logged per run. Decoding parameters are fixed and published: temperature=0, top_p=1, max_tokens=2000, seed=42 where supported. Exact model IDs (including snapshot date) are pinned in the preregistration and ship with the results.
Confounds I am NOT claiming to control for (stated so readers can weight accordingly):
Privileged authoring context (C1 has it, C2/C3 do not) — this is load-bearing for what C1 means; it is a feature of the condition, not a bug to eliminate. I am measuring the effect of breaking that privilege, not isolating it.
Vendor hidden system prompts — I cannot see Anthropic / Google / OpenAI’s internal pre-prompts; they differ across vendors by design.
Model version drift — I snapshot IDs at preregistration but vendors can change routing behind a named model. I report the exact IDs and re-run if any vendor announces a change before the F1 table ships.
Taxonomy contamination — if any reviewer has seen this exact taxonomy in its training or system prompt, it has an advantage. Taxonomy was not published before authoring began; it is published in this article, which means any reviewer touched by this article post-publication is contaminated. Eval runs complete before this article is indexed.
Grader leakage — graders are blind to condition but not blind to the taxonomy. This is documented and the scoring rubric separately reports category-match vs. mechanism-match to keep the two distinguishable.
Each artifact runs all four conditions. 50 flawed artifacts × 4 conditions = 200 review runs. Plus 10 zero-flaw controls × 4 conditions = 40 abstention runs. Total: 240 review runs.
Preregistered analysis rule. Before any review run executes, I commit publicly to the following decision rule:
Primary statistic: paired bootstrap of F1 deltas across artifacts (10,000 resamples, 95% CI).
H₀ survives if the 95% CI for each pairwise delta (C2−C1, C3a−C2, C3b−C2, C3a−C1, C3b−C1) contains zero.
H₁ supported if
C2 > C1CI excludes zero ANDC3_max − C2CI excludes zero ANDC3_max − C2lower bound is strictly positive whereC3_max = max(C3a, C3b).H₂ supported if H₁ is supported AND
C3_max − C2lower bound >C2 − C1upper bound (i.e., the family-effect dominates the session-effect non-overlappingly).Inconclusive if any decisive CI width exceeds ±0.10 on F1; in that case I enlarge the corpus and re-run before claiming a result.
Micro-F1 is primary; macro-F1 and cost-adjusted F1 are reported as secondary. Abstention rate on the zero-flaw controls is reported per condition as a separate table.3. Metric
Primary: micro-F1 on flaw detection vs. ground truth. A review “catches” a flaw if it names the flaw category (category-match) OR describes the specific mechanism with enough precision that a human grader marks it as a hit (mechanism-match). Category-match and mechanism-match are reported as separate F1s in addition to the combined score — this keeps taxonomy-pattern-matching distinguishable from independent reasoning. Grading uses two independent graders blind to condition; disagreements go to a third-grader arbitration pass. Inter-rater agreement (Cohen’s κ) is reported alongside the F1 table.
Zero-flaw controls: abstention rate per condition (ideal = 100% no-flaw-reported on the 10 clean artifacts; anything lower is overcalling).
Secondary: precision, recall, cost-adjusted F1, novel-flaw rate (flaws the human graders missed but the LLM identified — these are kept and audited; a portion will likely be added to future ground truth).
4. Token tracking
Total tokens per condition, per run. Reported alongside the F1 table.
Early signal — the cheap cascade, running today (directional, not the table)
The full 240-run experiment is 2 weeks out. But the cross-family cascade is already running — not as the controlled experiment above, but as a working implementation that anyone can fork. I built it, ran it, and I’m publishing the code + numbers alongside this article so you can redline both.
What’s running. A judge-panel cascade skill (public repo, v3.2.0): two cross-family small-fish judges (Gemini-3.1-Flash-Lite + GPT-5.4-nano) run in parallel as the first pass. If they agree with high confidence (≥80), the verdict stands. If they disagree OR confidence is low, one big-fish cross-family tiebreaker (GPT-5.4) fires. Stdlib Python; no SDK dependency; one command to reproduce.
First-batch numbers. Eight seeded-flaw cases, two rubrics (hallucination and flattery), ground truth committed before any LLM saw the artifacts:
MetricValue Accuracy100% (8/8) F1 (fail class)1.000 (P=1.000, R=1.000) Panel agreement rate (small-fish converge)75% Escalation rate (tiebreaker fires)25%
The two escalations fired exactly where a fake-citation case and an ambiguous-range claim split the small-fish panel — which is the cascade behaving as designed rather than the panel failing. No false positives; no false negatives.
What this is, in terms of the harness above. This is a narrower experiment than C1/C2/C3a/C3b — the small-fish panel runs cross-family, but the Part 2 thesis isolated to a single condition (cross-family vs. nothing). It’s the Part 2 claim retested under controlled conditions with committed ground truth and public code. It is NOT the fresh-session-vs-same-session comparison that the main harness above will ship in 2 weeks.
What this is not. Eight cases is a pilot, not a paper. The corpus was authored by the same person who wrote the rubrics — some hand-inherent bias. Single-session per condition, no intra-run variance measured (cf. Rating Roulette, EMNLP 2025 — intra-rater variance is first-order). The 50-case experiment above is the one I’ll stand behind as evidence; this is the one I’ll stand behind as “here is running code, here is a first signal, here is the open harness — fork it and try to break the result.”
Why publish directional numbers. Because the asymmetry is the point. The cost of not running cross-family review — Part 2 already showed — is measured in weeks of downstream rework when a closed-loop same-family review approves a drift that had to be undone later. Publishing directional numbers now, with the caveats above, lets readers run the cascade on their own corpus before the full table lands. Being wrong-in-public beats being confirmation-biased-in-private.
The harness for this pilot is in the same public repo as the 50-case harness above. Both are forkable. Both are run-able from one command.
Forward this to one engineer running LLM evals this quarter — the 240-run harness is forkable before the F1 table drops.
What you can do with this before I publish numbers
Three things, in increasing ambition.
(a) Add one cross-context review to your eval suite this week.
In the eval harnesses I’ve looked at across a dozen agentic-systems teams in the last six months, nearly none run a fresh-session pass on their own outputs — the primary model reviews itself in the same session, and that’s the review. If you add one fresh-session pass — same API key, same model, new session, no authoring history — you add an independent signal. Whether that signal is more or differently reliable than self-review is exactly what this harness is testing. Running it in parallel with current self-review loses you nothing even if H₀ turns out to be true. Recent judge-reliability work suggests intra-rater variance is itself first-order (Haldar & Hockenmaier, EMNLP 2025) — so a second independent pass is defensible on priors even before this harness lands.
(b) Run the harness on your own corpus.
If you build agentic systems, you already have a corpus of 50+ design docs, PR descriptions, architecture memos, or production-readiness reviews. Seed flaws into a held-out 20. Run C1/C2/C3a/C3b. Report F1 internally. The methodology travels.
(c) Bet against me.
If your prior is H₂ (family is the load-bearing variable and fresh-session barely helps), say so publicly. If it’s H₀ (none of this matters), say so. I’ll collect the bets and score them against the F1 table when it ships.
Why the methodology ships first
Every eval post I’ve read in the last six months has the same shape: here are the numbers, here’s the takeaway. That order is backwards when the numbers are the thing being contested.
Methodology published first forces the author to commit to a prediction before the data lands. It lets readers pre-register their own predictions. It rewards being wrong-in-public over being confirmation-biased-in-private.
The Andrew Ng frame — the best AI builders differentiate on eval quality — has a quiet corollary. Eval quality includes eval epistemics. Who ran the eval, in what context, against what prior, with what willingness to publish the unfavorable result.
Numbers without methodology are marketing. Methodology without numbers is scaffolding. I’d rather ship the scaffolding first.
What’s next
Already public (v3.2.0, today): the cross-family cascade skill + pilot harness — github.com/thewhyman/prompt-engineering-in-action. One-command reproduce.
~2 weeks: F1 table + full methodology write-up + the 50-case C1/C2/C3a/C3b harness code on the same public repo.
Week of 2026-05-12 (Part 4): The Eval Taxonomy Production Systems Don’t Have — the 7–10 categories most eval harnesses under-specify, with an implementation sketch per category.
Week of 2026-05-12 (Part 5): The Stop Button Nobody Built — intelligence collapse as an architecture problem, not an alignment one.
→ If you’re running an eval harness on agentic systems and want to compare methodology notes before the F1 table ships, I’d genuinely like to talk.
Prior art — the work this harness stands on
Cross-model and judge-reliability research in the last 12 months has sharpened exactly the questions this harness is built to answer. Readers evaluating my methodology should evaluate these alongside it:
Verga et al., 2024 — Replacing Judges with Juries: Panel-of-LLM-evaluators (arXiv:2404.18796) — multi-model panel evaluation; supportive evidence for panel-level diversity.
Wataoka et al., 2024 — Self-Preference Bias in LLM-as-Judge (arXiv:2410.21819) — same-family reviewers prefer same-family outputs.
Li et al., 2025 / ICLR 2026 — Preference Leakage in LLM-as-Judge (arXiv:2502.01534) — training-distribution leakage across judge-candidate pairs.
Xu et al., ACL 2025 — Does Context Matter? ContextualJudgeBench (paper) — contextual evaluation is itself a hard open problem.
Huang et al., Findings ACL 2025 — An Empirical Study of LLM-as-a-Judge (paper) — judge generalization and fairness.
Shi et al., IJCNLP 2025 — Judging the Judges (paper) — position bias as systematic, not noise.
Haldar & Hockenmaier, EMNLP 2025 — Rating Roulette (paper) — intra-rater variance is a first-class reliability issue; directly load-bearing for the fresh-session hypothesis.
Guerdan et al., NeurIPS 2025 — Rating Indeterminacy in Human and Model Evaluation (paper) — single gold labels systematically mis-validate judges; complicates any claim that one ground-truth grading pass is authoritative.
Wang et al., ICLR 2026 — Evaluating Evaluators: Judge Inconsistency and Transitivity Failures (paper) — LLM judges exhibit non-transitive preference orderings; the harness’s paired-bootstrap design is designed to be robust to exactly this failure mode.
A preregistration snapshot (conditions, model IDs, decoding params, analysis rule, corpus commit hash) will ship with the F1 table in ~2 weeks. If you want the preregistration document before runs begin, subscribe and reply; I’ll send it.
#AIReliability #LLMEvaluation #LLMasJudge #EvalDrivenDevelopment #AIEngineering #ModelDiversity #AIAgents #MCP #FrontierAI #AppliedAI #ResponsibleAI #BuildInPublic

