The diversity effect

What a panel-of-experts scaffold does to RL-trained reasoning, and what happens when the diversity it produces meets RLVR on olympiad math.

STEPHEN CASELLA · APRIL 2026

GitHub repo

TL;DR

The same backbone, post-trained two ways. Trained as a panel of experts arguing inside a debate tag, it produces reasoning traces that sit +78.2% further apart in mpnet-space on MATH500 (paired t = 5.91, p < 10⁻⁸) and +75.6% further apart on AIME (t = 4.72, p < 10⁻⁵) than the same model post-trained to think privately. The pass@k gap to Qwen3-thinking narrows monotonically in k across three independent slices.

Update — Apr 25. Wider per-sample search shows up where it matters for RL. On a fresh 877-problem olympiad pool the panel has 382 variance-band problems against thinking's 209: the only regime where group-relative RLVR generates non-zero gradient, and a 1.83× ratio in panel's favor. 100 RL steps on that band carry panel from 14% to 29% on a shared held-out, with per-source gains scaling with training-pool representation. Training adapter: LoRA rank 32 on the frozen base, four to five orders of magnitude under the baseline's post-training compute.

Is the scaffold doing real work?

The default reasoning scaffold is <think>: a long private monologue, often tens of thousands of tokens, before the answer. Every frontier lab ships a thinking model. The format has become invisible the way "let's think step by step" became invisible a few years before it.

But the format is a choice. Train a base model to deliberate as a panel of experts instead, and the comparison the field will ask for first is obvious.

The question

Does the panel scaffold actually reason differently from <think>, or is the difference cosmetic?

Two benchmarks, paired significance tests, and a mechanical check on pass@k say it reasons differently. Each panel sample covers more of the solution surface than each thinking sample. Successive panel samples add more marginal coverage than successive thinking samples. Same coin, two sides: search that's wider per sample and compounds over k. The cost is single-shot accuracy. Every property that depends on sampling more than once moves the other way.

Setup — one backbone, two scaffolds

The result depends on a matched comparison, so the setup deserves more attention than the headline.

The backbone

Qwen3-30B-A3B. A 30B-parameter mixture-of-experts model with roughly 3B active parameters per token. Three releases on the same architecture: a pure Base, a production hybrid that toggles <think> via a chat-template flag, and a dedicated Thinking-2507 variant.

The model under test

The panel adapter starts from Base. Pure RL, no SFT warmup, no distillation. The prompt instructs the model to deliberate inside <mutipersonaDebate>…</mutipersonaDebate> and commit a final answer inside <answer>…</answer>. Reward is outcome correctness plus a small tag-validity term. Two stages: 128 GSM8K gradient steps, then a few dozen on MATH. LoRA rank 32 throughout. That is the entire recipe.

# Panel prompt template (used at both train and eval time)
A conversation between User and Multi-Persona Panel of Experts.
The user asks a question, and the Panel solves it. The Panel first
deliberates and debates the reasoning process with each other and
then provides the user with the answer.

User: {problem}
Assistant: <mutipersonaDebate> … </mutipersonaDebate> <answer> … </answer>

The baseline

Qwen's production thinking model. Same base architecture, same parameter count, full proprietary post-training stack. We access it through apply_chat_template(enable_thinking=True), which routes the request to the post-trained policy that emits <think>…</think> before the answer. It is the closest peer the architecture allows.

What makes this comparison fair

Same backbone, same parameter count, same temperature (1.0), same benchmarks, same n, same grader. Two axes vary: the reasoning format each model was post-trained to emit, and the amount of post-training behind it.

Panel recipe

~200 steps

LoRA rank 32 · single node

Trainable params

~50 M

< 0.2% of 30B backbone

Qwen pipeline

SFT + RL

Industrial GPU budget · full weights

Post-train ratio

~10⁴–10⁵×

Baseline vs ours · conservative

Measuring "different reasoning"

Diversity is one of those words that means too many things to be useful without an operational definition. The one we used: draw N completions from each model on the same problem at temperature 1.0, group them per problem and per model, and look at where they sit in semantic space relative to each other. A scaffold that explores more of the solution surface per sample should produce traces that sit further apart.

01 · extract Isolate the reasoning Strip the final answer. Keep the contents of <mutipersonaDebate> (panel) or <think> (thinking). Tag-extraction hit rate: 200/200 on both variants — no fallback artifacts.

02 · embed Encode the opening First 2,000 characters per trace, sentence-transformers/all-mpnet-base-v2. The opening is where strategy is chosen; the rest is elaboration. Stable at an 8,000-char window too.

03 · score Mean pairwise distance For each group of N embeddings, compute mean pairwise cosine distance. Average across problems. Paired t-test on problem-level deltas: same problem, panel vs thinking, whose samples are further apart?

04 · sanity Discrete-mode check DBSCAN on the embeddings (eps=0.25, cosine). If spread goes up but cluster count is flat, the lift is continuous-variance — same one strategy explored from more angles.

The pipeline ran on two benchmarks, paired problem-for-problem. Every comparison below is same-problem, same-n.

The headline — wider search per sample

Mean pairwise cosine distance across the N traces a model produces on one problem is a direct proxy for per-sample search volume: how much of the solution surface a single independent sample touches before committing. Higher distance, wider search. Panel's per-sample volume is larger than thinking's on both benchmarks, with nearly identical relative effect sizes.

MATH500 Δ rel

+78.2%

t = 5.91 · p < 10⁻⁸

AIME Δ rel

+75.6%

t = 4.72 · p < 10⁻⁵

Per character

~6× wider

Panel uses ~⅛ the tokens

Pass@k closure

+15.6 pp

k=1 → k=16 on AIME

Table 01Per-variant diversity, paired on same problems

Benchmark	Panel	Thinking	Δ abs	Δ rel	t	n
MATH500 L1–L5	0.095	0.053	+0.042	+78.2%	+5.91	50
AIME 2024+2025	0.119	0.068	+0.051	+75.6%	+4.72	20

The two benchmarks sample different difficulty regimes. MATH500 is near-saturated for thinking (pass@3 = 1.000 on the stratified slice). AIME is not; thinking solves 73% per sample. If the diversity effect were an artifact of both models converging on the same easy answers, the result would shrink on the harder benchmark. It doesn't; the relative effect size is essentially identical.

The lift is not from verbosity

Panel's traces are roughly 1,400 characters on AIME against thinking's 11,300. Thinking writes about 8× as much text per sample. Panel achieves its diversity lift in one-eighth the token budget. Per character, the spread is about 6× wider. The scaffold is not more diverse because it is longer; it is more divergent at every length.

DBSCAN check

Cluster count holds at ~1.0 for both models on both benchmarks. The lift is continuous-variance, not discrete-mode. Both scaffolds commit to one strategy per problem; panel orbits that strategy across a wider arc.

The pass@k signature — search that compounds

An mpnet result on its own is fair to doubt. The next step was to look for a mechanical consequence with no embeddings inside it. If panel really covers more solution surface per sample, each additional panel sample should add more marginal coverage than each additional thinking sample. The two pass@k curves should converge as k grows.

Figure 1. Matched pass@k on AIME 2024+25, both models sampled at n=16 across 20 problems. Unbiased Codex-paper estimator: pass@k = 1 − C(n−c, k) / C(n, k). Panel never crosses thinking in the measured range, but the gap narrows monotonically with k — the signature the diversity hypothesis predicts.

The signature replicates on three independent slices, not just AIME.

Table 02Gap closure across slices — the diversity signature replicates

Slice	k range	Gap at k=1	Gap at k_max	Closure
MATH500 L1–L5 stratified	k=1 → 4	−16.5pp	−10.0pp	+6.5pp
MATH500 L5 full sweep	k=1 → 4	−32.5pp	−17.8pp	+14.7pp
AIME 2024+2025	k=1 → 16	−50.6pp	−35.0pp	+15.6pp

Three slices, three positive closures, all monotone in k. The pass@k mechanism agrees with the embedding measurement, and it does so without knowing anything about embeddings.

Bounds on the claim

A few honest bounds, so the result is not over-claimed on the way out.

What the result is not

Per-sample accuracy on the panel is lower than on thinking: about 33 pp lower on MATH500 L5, about 51 pp lower on AIME. In the k range we tested the curves converge but never cross. On AIME, panel solves a strict subset of what thinking solves (11/20 vs 18/20, no panel-only wins). The MATH RL stage also compressed diversity from 0.174 to 0.095; the released checkpoint is an exploration–reliability compromise, not a pure-diversity win.

The reviewer question we'd most like answered next: is the effect the scaffold, or is it our specific RL recipe? Answering it requires a matched-compute <think>-scaffolded LoRA trained with the same reward, same step count, same base. That experiment is still open.

From wider search to faster hill-climbing

The original draft promised a third benchmark as the next step. The follow-up went further than that. Instead of a pass@k re-replication on a new dataset, we asked whether the per-sample diversity converts into gradient signal under RLVR.

Group-relative RLVR (GRPO, RLOO) only produces non-zero advantage when a group of rollouts on the same problem contains both correct and incorrect samples. The variance band — problems with 0 < pass@G < 1 — is therefore the only regime that contributes to the policy gradient. Wider per-sample search predicts a wider variance band: more rollouts that fall outside both ceilings, more problems with usable advantage, more trainable data.

The pool: 877 olympiad problems from HMMT 2024+25, AIME 2024+25, the OlympiadBench math/English/competition split, and AMC. Each problem was sampled G=8 times per arm at temperature 1.0 and sorted into all_zero (none correct), variance_band (1–7 of 8 correct), or all_one (all 8 correct). Variance band carries gradient. The other two cells are training dead weight under group-relative RLVR.

Variance-band size — diversity on the gradient axis

On the same 877 problems, same G, same temperature:

Panel VB

382

43.6% of pool · 0 < pass@8 < 1

Thinking VB

209

23.8% of pool

VB ratio

1.83×

Panel / thinking

Token-cost ratio

~17×

Thinking ac_tokens / panel

Panel sees gradient signal on 382 of 877 problems. Thinking sees it on 209. The 1.83× ratio is the cleanest first-order test of the diversity claim, expressed on the property that controls RL throughput rather than embedding spread. The 17× token-cost ratio is the verbosity story sharpened: panel terminates compactly inside its debate; thinking runs near max_tokens on most problems.

Disjoint frontiers

The two arms' variance bands barely overlap. Cross-tabulating panel × thinking band assignments on the shared pool:

Table 03Joint band contingency on the 877-problem pool

panel \ thinking	all_zero	variance_band	all_one	row total
all_zero	232	137	108	477
variance_band	30	72	280	382
all_one	0	0	18	18
column total	262	209	406	877

Only 72 problems sit in both bands at once: 19% of panel's, 34% of thinking's. The largest off-diagonal cell on panel's row is the 280 problems where panel still has signal but thinking is saturated. The largest cell on thinking's column is the 137 problems where thinking has signal but panel is below the noise floor. The two scaffolds are working in different parts of problem-space, and panel's part is wider. A joint training pool would have discarded exactly the cells the diversity claim predicted.

Each arm therefore trains on its own variance band; both score on the same stratified 100-problem held-out (25 per source, drawn from the shared pool, excluded from both training pools). Hyperparameters held constant across arms: LoRA rank 32, lr 5e-6, group size 16, batch 8, temperature 1.0, max_tokens 12,288, 100 steps, eval_every 10.

Panel hill-climbs on its own variance band

Only the panel arm reached step 100. At ~17× the per-rollout token cost, the matched 100-step thinking run would have taken about 37 hours of Tinker compute against panel's ~2.6, and we hit a billing wall about three hours in. The panel trajectory:

Figure 2. Panel held-out pass@1 across 100 RL steps on the panel variance band (354 training problems). Trajectory shape: brief weight-shock dip at batch 10 → monotone climb to batch 50 → step-jump at batch 70 → plateau at 27–29%. Vanilla Qwen3-thinking sits at ~60% on the same held-out (off-chart); panel ends at roughly half of thinking's pre-RL pass rate, but more than doubles its own starting point.

The mechanism — gains scale with training-pool representation

Panel's training pool is unevenly distributed across sources because each source contributes a different fraction of problems to panel's variance band. The held-out gains track that distribution closely:

Table 04Per-source held-out gains scale with training-pool representation

source	% of panel_train	batch 0	batch 90	Δ
OlympiadBench	88% (311/354)	20%	44%	+24pp
AMC	10% (35/354)	32%	52%	+20pp
AIME	2% (6/354)	4%	16%	+12pp
HMMT	0.6% (2/354)	0%	4%	+4pp
aggregate (n=100)	100%	14%	29%	+15pp

Two readings of the table. First: the absolute gain is largest where the training pool is densest (OlympiadBench, +24 pp on 88% of train). Second: every source moved, including HMMT, which had effectively no direct training data. The pattern is transfer, not memorization. The 4 pp HMMT lift is one problem out of 25, so the absolute number is at the noise floor on its own. The monotone ordering across sources is not.

Diversity through RL — entropy collapses, length-spread doesn't

RL is a known diversity-killer. The natural worry was that 100 steps of RLVR would collapse the policy distribution and quietly erase the property the scaffold was chosen for. We tracked two diversity proxies across the run: per-token policy entropy, and trajectory-length spread within a group of 16 rollouts on the same problem.

Token entropy (start)

0.591

batch 0–9 mean

Token entropy (end)

0.158

batch 90–99 mean (−73%)

Length CV (start)

0.36

ac_len within group, batch 0

Length CV (end)

0.66

batch 90 (+83%)

Entropy goes the way the textbook predicts. Length spread does not. Panel becomes more confident per token while producing increasingly varied trajectory lengths on the same problem. The signature reads as the three personas keeping their distinct verbosities even as each persona's per-token distribution sharpens.

Length CV is a cheap proxy, not a semantic one. The full mpnet-cosine measurement on a post-RL checkpoint, the same pipeline that produced the +78% / +76% numbers above, is not in this run. It is in the open list at the bottom of the post. The directional read here is consistent with the diversity story; the strong claim waits for the strong measurement.

What this section does and doesn't say

Honest scope

Does say: the variance-band advantage the diversity claim predicts is real (382 vs 209 on the 877-problem pool). It converts to RL gradient signal: 100 steps double panel's olympiad held-out pass rate (14% → 29%), with per-source gains scaling with training-pool representation. The diversity reservoir survives the run on the proxy we measured — per-token entropy collapses, but length spread within a group grows.
Does not say: the matched <think> RL comparison ran to completion. At ~17× the per-rollout token cost, the matched 100-step thinking run would have taken about 37 hours of Tinker compute, and we hit the billing wall first. The static reference is vanilla Qwen3-thinking on the shared held-out: ~60% pass@1 aggregate, ~55% on OlympiadBench, ~75% on AMC, all at G=8. Panel ends at 29%, about half. The result here is about rate of climb under matched hyperparameters, not absolute capability.
Single seed, single run: the +15 pp number is one seed. The curve's shape replicates across the per-source breakdown; the absolute number wants a second seed before its CI gets reported.

What's next

The Apr 25 update closes the cross-benchmark item from the original list. Variance-band classifications now exist on HMMT, AIME, OlympiadBench, and AMC, with held-out evaluation across all four sources. The remaining items, ordered by how much each would sharpen the claim:

Matched-compute `<think>` RL arm

Run a 100-step RL on the thinking variance band (180 problems) at the same hyperparameters and put both curves on the same chart. Wall-time estimate: ~37 hours of Tinker compute. The training pool is ready (data/olympiad_pool/thinking_train.jsonl). The run was started and paused at the billing wall. This is the experiment that decides whether the effect is the scaffold or the recipe.

Multi-seed on the panel hill-climbing run

+15 pp on a 100-problem held-out is one seed and a wide CI. A second seed under matched hyperparameters gives the rate-of-climb claim a real number. ~2.6 hours of compute, panel-only.

Held-out diversity post-RL

We measured RL-time entropy (collapsed) and length CV (grew). The semantic measurement on a post-RL checkpoint, the same mpnet-cosine pipeline that produced the +78% and +76% numbers earlier, is the missing piece. If those numbers survive 100 steps of RL, the scaffold's diversity is a stable equilibrium under gradient pressure rather than a starting condition.

Scaling study on n

Does the gap-closure curve keep tightening past the sample budgets we tried? Pushing panel to n=64 or n=128 on the L5 slice gives the asymptote, which decides between "diversity that eventually catches up" and "diversity that plateaus below thinking." Current data points to the latter. The slope at n=16 is not zero.

· · ·

The headline holds with the addition. A LoRA rank-32 adapter, a few hundred gradient steps, four to five orders of magnitude under the baseline's post-training compute. The scaffold widens per-sample search by 76–78% on two benchmarks. The widening compounds over k on three independent slices. On a fresh olympiad pool it puts 1.83× more problems in the gradient-signal-bearing band than the production thinking model. 100 RL steps on that band double the held-out pass rate. The mechanism that helped at inference time — more coverage per sample — helps at training time too, because group-relative RLVR only sees variance-band problems, and panel has more of them.

Scaffold choice is a lever, and it costs nothing at inference. The remaining questions narrow to three: matched <think> RL versus panel RL under the same recipe, semantic diversity on a post-RL checkpoint, and the asymptote at larger n. The further-out question — what a full post-training pipeline could do with a panel scaffold instead of against one — sits past the next set of experiments. The new data moves it closer.

The diversity effect

Is the scaffold doing real work?

Setup — one backbone, two scaffolds

The backbone

The model under test

The baseline

Measuring "different reasoning"

The headline — wider search per sample

The lift is not from verbosity

The pass@k signature — search that compounds

Bounds on the claim

From wider search to faster hill-climbing

Variance-band size — diversity on the gradient axis

Disjoint frontiers

Panel hill-climbs on its own variance band

The mechanism — gains scale with training-pool representation

Diversity through RL — entropy collapses, length-spread doesn't

What this section does and doesn't say

What's next

Matched-compute <think> RL arm

Multi-seed on the panel hill-climbing run

Held-out diversity post-RL

Scaling study on n

Matched-compute `<think>` RL arm