The diversity effect
What a panel-of-experts scaffold does to RL-trained reasoning, and what happens when the diversity it produces meets RLVR on olympiad math.
STEPHEN CASELLA · APRIL 2026
The same backbone, post-trained two ways. Trained as a panel of experts arguing inside a debate tag, it produces reasoning traces that sit +78.2% further apart in mpnet-space on MATH500 (paired t = 5.91, p < 10⁻⁸) and +75.6% further apart on AIME (t = 4.72, p < 10⁻⁵) than the same model post-trained to think privately. The pass@k gap to Qwen3-thinking narrows monotonically in k across three independent slices.
Update — Apr 25. Wider per-sample search shows up where it matters for RL. On a fresh 877-problem olympiad pool the panel has 382 variance-band problems against thinking's 209: the only regime where group-relative RLVR generates non-zero gradient, and a 1.83× ratio in panel's favor. 100 RL steps on that band carry panel from 14% to 29% on a shared held-out, with per-source gains scaling with training-pool representation. Training adapter: LoRA rank 32 on the frozen base, four to five orders of magnitude under the baseline's post-training compute.
Is the scaffold doing real work?
The default reasoning scaffold is <think>: a long private monologue, often tens of thousands of tokens, before the answer. Every frontier lab ships a thinking model. The format has become invisible the way "let's think step by step" became invisible a few years before it.
But the format is a choice. Train a base model to deliberate as a panel of experts instead, and the comparison the field will ask for first is obvious.
Does the panel scaffold actually reason differently from <think>, or is the difference cosmetic?
Two benchmarks, paired significance tests, and a mechanical check on pass@k say it reasons differently. Each panel sample covers more of the solution surface than each thinking sample. Successive panel samples add more marginal coverage than successive thinking samples. Same coin, two sides: search that's wider per sample and compounds over k. The cost is single-shot accuracy. Every property that depends on sampling more than once moves the other way.
Setup — one backbone, two scaffolds
The result depends on a matched comparison, so the setup deserves more attention than the headline.
The backbone
Qwen3-30B-A3B. A 30B-parameter mixture-of-experts model with roughly 3B active parameters per token. Three releases on the same architecture: a pure Base, a production hybrid that toggles <think> via a chat-template flag, and a dedicated Thinking-2507 variant.
The model under test
The panel adapter starts from Base. Pure RL, no SFT warmup, no distillation. The prompt instructs the model to deliberate inside <mutipersonaDebate>…</mutipersonaDebate> and commit a final answer inside <answer>…</answer>. Reward is outcome correctness plus a small tag-validity term. Two stages: 128 GSM8K gradient steps, then a few dozen on MATH. LoRA rank 32 throughout. That is the entire recipe.
# Panel prompt template (used at both train and eval time)
A conversation between User and Multi-Persona Panel of Experts.
The user asks a question, and the Panel solves it. The Panel first
deliberates and debates the reasoning process with each other and
then provides the user with the answer.
User: {problem}
Assistant: <mutipersonaDebate> … </mutipersonaDebate> <answer> … </answer>
The baseline
Qwen's production thinking model. Same base architecture, same parameter count, full proprietary post-training stack. We access it through apply_chat_template(enable_thinking=True), which routes the request to the post-trained policy that emits <think>…</think> before the answer. It is the closest peer the architecture allows.
Same backbone, same parameter count, same temperature (1.0), same benchmarks, same n, same grader. Two axes vary: the reasoning format each model was post-trained to emit, and the amount of post-training behind it.
Measuring "different reasoning"
Diversity is one of those words that means too many things to be useful without an operational definition. The one we used: draw N completions from each model on the same problem at temperature 1.0, group them per problem and per model, and look at where they sit in semantic space relative to each other. A scaffold that explores more of the solution surface per sample should produce traces that sit further apart.
<mutipersonaDebate> (panel) or <think> (thinking). Tag-extraction hit rate: 200/200 on both variants — no fallback artifacts.
sentence-transformers/all-mpnet-base-v2. The opening is where strategy is chosen; the rest is elaboration. Stable at an 8,000-char window too.
The pipeline ran on two benchmarks, paired problem-for-problem. Every comparison below is same-problem, same-n.
The headline — wider search per sample
Mean pairwise cosine distance across the N traces a model produces on one problem is a direct proxy for per-sample search volume: how much of the solution surface a single independent sample touches before committing. Higher distance, wider search. Panel's per-sample volume is larger than thinking's on both benchmarks, with nearly identical relative effect sizes.
| Benchmark | Panel | Thinking | Δ abs | Δ rel | t | n |
|---|---|---|---|---|---|---|
| MATH500 L1–L5 | 0.095 | 0.053 | +0.042 | +78.2% | +5.91 | 50 |
| AIME 2024+2025 | 0.119 | 0.068 | +0.051 | +75.6% | +4.72 | 20 |
The two benchmarks sample different difficulty regimes. MATH500 is near-saturated for thinking (pass@3 = 1.000 on the stratified slice). AIME is not; thinking solves 73% per sample. If the diversity effect were an artifact of both models converging on the same easy answers, the result would shrink on the harder benchmark. It doesn't; the relative effect size is essentially identical.
The lift is not from verbosity
Panel's traces are roughly 1,400 characters on AIME against thinking's 11,300. Thinking writes about 8× as much text per sample. Panel achieves its diversity lift in one-eighth the token budget. Per character, the spread is about 6× wider. The scaffold is not more diverse because it is longer; it is more divergent at every length.
Cluster count holds at ~1.0 for both models on both benchmarks. The lift is continuous-variance, not discrete-mode. Both scaffolds commit to one strategy per problem; panel orbits that strategy across a wider arc.
The pass@k signature — search that compounds
An mpnet result on its own is fair to doubt. The next step was to look for a mechanical consequence with no embeddings inside it. If panel really covers more solution surface per sample, each additional panel sample should add more marginal coverage than each additional thinking sample. The two pass@k curves should converge as k grows.
The signature replicates on three independent slices, not just AIME.
| Slice | k range | Gap at k=1 | Gap at k_max | Closure |
|---|---|---|---|---|
| MATH500 L1–L5 stratified | k=1 → 4 | −16.5pp | −10.0pp | +6.5pp |
| MATH500 L5 full sweep | k=1 → 4 | −32.5pp | −17.8pp | +14.7pp |
| AIME 2024+2025 | k=1 → 16 | −50.6pp | −35.0pp | +15.6pp |
Three slices, three positive closures, all monotone in k. The pass@k mechanism agrees with the embedding measurement, and it does so without knowing anything about embeddings.
Bounds on the claim
A few honest bounds, so the result is not over-claimed on the way out.
Per-sample accuracy on the panel is lower than on thinking: about 33 pp lower on MATH500 L5, about 51 pp lower on AIME. In the k range we tested the curves converge but never cross. On AIME, panel solves a strict subset of what thinking solves (11/20 vs 18/20, no panel-only wins). The MATH RL stage also compressed diversity from 0.174 to 0.095; the released checkpoint is an exploration–reliability compromise, not a pure-diversity win.
The reviewer question we'd most like answered next: is the effect the scaffold, or is it our specific RL recipe? Answering it requires a matched-compute <think>-scaffolded LoRA trained with the same reward, same step count, same base. That experiment is still open.
From wider search to faster hill-climbing
The original draft promised a third benchmark as the next step. The follow-up went further than that. Instead of a pass@k re-replication on a new dataset, we asked whether the per-sample diversity converts into gradient signal under RLVR.
Group-relative RLVR (GRPO, RLOO) only produces non-zero advantage when a group of rollouts on the same problem contains both correct and incorrect samples. The variance band — problems with 0 < pass@G < 1 — is therefore the only regime that contributes to the policy gradient. Wider per-sample search predicts a wider variance band: more rollouts that fall outside both ceilings, more problems with usable advantage, more trainable data.
The pool: 877 olympiad problems from HMMT 2024+25, AIME 2024+25, the OlympiadBench math/English/competition split, and AMC. Each problem was sampled G=8 times per arm at temperature 1.0 and sorted into all_zero (none correct), variance_band (1–7 of 8 correct), or all_one (all 8 correct). Variance band carries gradient. The other two cells are training dead weight under group-relative RLVR.
Variance-band size — diversity on the gradient axis
On the same 877 problems, same G, same temperature:
Panel sees gradient signal on 382 of 877 problems. Thinking sees it on 209. The 1.83× ratio is the cleanest first-order test of the diversity claim, expressed on the property that controls RL throughput rather than embedding spread. The 17× token-cost ratio is the verbosity story sharpened: panel terminates compactly inside its debate; thinking runs near max_tokens on most problems.
Disjoint frontiers
The two arms' variance bands barely overlap. Cross-tabulating panel × thinking band assignments on the shared pool:
| panel \ thinking | all_zero | variance_band | all_one | row total |
|---|---|---|---|---|
| all_zero | 232 | 137 | 108 | 477 |
| variance_band | 30 | 72 | 280 | 382 |
| all_one | 0 | 0 | 18 | 18 |
| column total | 262 | 209 | 406 | 877 |
Only 72 problems sit in both bands at once: 19% of panel's, 34% of thinking's. The largest off-diagonal cell on panel's row is the 280 problems where panel still has signal but thinking is saturated. The largest cell on thinking's column is the 137 problems where thinking has signal but panel is below the noise floor. The two scaffolds are working in different parts of problem-space, and panel's part is wider. A joint training pool would have discarded exactly the cells the diversity claim predicted.
Each arm therefore trains on its own variance band; both score on the same stratified 100-problem held-out (25 per source, drawn from the shared pool, excluded from both training pools). Hyperparameters held constant across arms: LoRA rank 32, lr 5e-6, group size 16, batch 8, temperature 1.0, max_tokens 12,288, 100 steps, eval_every 10.
Panel hill-climbs on its own variance band
Only the panel arm reached step 100. At ~17× the per-rollout token cost, the matched 100-step thinking run would have taken about 37 hours of Tinker compute against panel's ~2.6, and we hit a billing wall about three hours in. The panel trajectory:
The mechanism — gains scale with training-pool representation
Panel's training pool is unevenly distributed across sources because each source contributes a different fraction of problems to panel's variance band. The held-out gains track that distribution closely:
| source | % of panel_train | batch 0 | batch 90 | Δ |
|---|---|---|---|---|
| OlympiadBench | 88% (311/354) | 20% | 44% | +24pp |
| AMC | 10% (35/354) | 32% | 52% | +20pp |
| AIME | 2% (6/354) | 4% | 16% | +12pp |
| HMMT | 0.6% (2/354) | 0% | 4% | +4pp |
| aggregate (n=100) | 100% | 14% | 29% | +15pp |
Two readings of the table. First: the absolute gain is largest where the training pool is densest (OlympiadBench, +24 pp on 88% of train). Second: every source moved, including HMMT, which had effectively no direct training data. The pattern is transfer, not memorization. The 4 pp HMMT lift is one problem out of 25, so the absolute number is at the noise floor on its own. The monotone ordering across sources is not.
Diversity through RL — entropy collapses, length-spread doesn't
RL is a known diversity-killer. The natural worry was that 100 steps of RLVR would collapse the policy distribution and quietly erase the property the scaffold was chosen for. We tracked two diversity proxies across the run: per-token policy entropy, and trajectory-length spread within a group of 16 rollouts on the same problem.
Entropy goes the way the textbook predicts. Length spread does not. Panel becomes more confident per token while producing increasingly varied trajectory lengths on the same problem. The signature reads as the three personas keeping their distinct verbosities even as each persona's per-token distribution sharpens.
Length CV is a cheap proxy, not a semantic one. The full mpnet-cosine measurement on a post-RL checkpoint, the same pipeline that produced the +78% / +76% numbers above, is not in this run. It is in the open list at the bottom of the post. The directional read here is consistent with the diversity story; the strong claim waits for the strong measurement.
What this section does and doesn't say
- Does say: the variance-band advantage the diversity claim predicts is real (382 vs 209 on the 877-problem pool). It converts to RL gradient signal: 100 steps double panel's olympiad held-out pass rate (14% → 29%), with per-source gains scaling with training-pool representation. The diversity reservoir survives the run on the proxy we measured — per-token entropy collapses, but length spread within a group grows.
- Does not say: the matched
<think>RL comparison ran to completion. At ~17× the per-rollout token cost, the matched 100-step thinking run would have taken about 37 hours of Tinker compute, and we hit the billing wall first. The static reference is vanilla Qwen3-thinking on the shared held-out: ~60% pass@1 aggregate, ~55% on OlympiadBench, ~75% on AMC, all at G=8. Panel ends at 29%, about half. The result here is about rate of climb under matched hyperparameters, not absolute capability. - Single seed, single run: the +15 pp number is one seed. The curve's shape replicates across the per-source breakdown; the absolute number wants a second seed before its CI gets reported.
What's next
The Apr 25 update closes the cross-benchmark item from the original list. Variance-band classifications now exist on HMMT, AIME, OlympiadBench, and AMC, with held-out evaluation across all four sources. The remaining items, ordered by how much each would sharpen the claim:
Matched-compute <think> RL arm
Run a 100-step RL on the thinking variance band (180 problems) at the same hyperparameters and put both curves on the same chart. Wall-time estimate: ~37 hours of Tinker compute. The training pool is ready (data/olympiad_pool/thinking_train.jsonl). The run was started and paused at the billing wall. This is the experiment that decides whether the effect is the scaffold or the recipe.
Multi-seed on the panel hill-climbing run
+15 pp on a 100-problem held-out is one seed and a wide CI. A second seed under matched hyperparameters gives the rate-of-climb claim a real number. ~2.6 hours of compute, panel-only.
Held-out diversity post-RL
We measured RL-time entropy (collapsed) and length CV (grew). The semantic measurement on a post-RL checkpoint, the same mpnet-cosine pipeline that produced the +78% and +76% numbers earlier, is the missing piece. If those numbers survive 100 steps of RL, the scaffold's diversity is a stable equilibrium under gradient pressure rather than a starting condition.
Scaling study on n
Does the gap-closure curve keep tightening past the sample budgets we tried? Pushing panel to n=64 or n=128 on the L5 slice gives the asymptote, which decides between "diversity that eventually catches up" and "diversity that plateaus below thinking." Current data points to the latter. The slope at n=16 is not zero.
The headline holds with the addition. A LoRA rank-32 adapter, a few hundred gradient steps, four to five orders of magnitude under the baseline's post-training compute. The scaffold widens per-sample search by 76–78% on two benchmarks. The widening compounds over k on three independent slices. On a fresh olympiad pool it puts 1.83× more problems in the gradient-signal-bearing band than the production thinking model. 100 RL steps on that band double the held-out pass rate. The mechanism that helped at inference time — more coverage per sample — helps at training time too, because group-relative RLVR only sees variance-band problems, and panel has more of them.
Scaffold choice is a lever, and it costs nothing at inference. The remaining questions narrow to three: matched <think> RL versus panel RL under the same recipe, semantic diversity on a post-RL checkpoint, and the asymptote at larger n. The further-out question — what a full post-training pipeline could do with a panel scaffold instead of against one — sits past the next set of experiments. The new data moves it closer.