April 2026

The diversity effect

A panel-of-experts scaffold changes RL-trained reasoning by widening per-sample search, then turns that diversity into RLVR signal on olympiad math.

TL;DR

I post-trained the same backbone two ways. Trained as a panel of experts arguing inside a debate tag, it produces reasoning traces that sit +78.2% further apart in mpnet-space on MATH500 (paired t = 5.91, p < 10⁻⁸) and +75.6% further apart on AIME (t = 4.72, p < 10⁻⁵) than the same model post-trained to think privately. The pass@k gap to Qwen3-thinking narrows monotonically in k across three independent slices.

Deployment cost. On the 374 paired draws across MATH500 L5 and AIME 24+25 where both arms produced a correct answer, thinking uses 5.9-6.9× more tokens (median) than the panel, and it runs longer than the panel in 99.6-99.7% of all paired draws regardless of correctness. At the cost-per-correct-answer level, total tokens divided by correct samples, thinking is 2.5-5× more expensive than the panel. Per-sample accuracy still favors thinking; per-token efficiency reverses the comparison. (Wilcoxon signed-rank: p = 6×10⁻⁵² on MATH, p = 8×10⁻¹³ on AIME.)

Apr 25 update. Wider per-sample search shows up where it matters for RL. On a fresh 877-problem olympiad pool the panel has 382 variance-band problems against thinking's 209: the only regime where group-relative RLVR generates non-zero gradient, and a 1.83× ratio in panel's favor. 100 RL steps on that band carry panel from 14% to 29% on a shared held-out. That's about half of vanilla thinking's pre-RL ~60% on the same problems, but more than double panel's own starting point. Training adapter: LoRA rank 32 on the frozen base, four to five orders of magnitude under the baseline's post-training compute. A matched-compute thinking LoRA would distinguish scaffold from RL recipe. That experiment has not yet run.

Scaffold work, measured

The default reasoning scaffold is <think>: a long private monologue, often tens of thousands of tokens, before the answer. Every frontier lab ships a thinking model. The format has become invisible the way "let's think step by step" became invisible a few years before it.

But the format is a choice. Train a base model to deliberate as a panel of experts instead, and the comparison the field will ask for first is obvious.

The question

The panel scaffold must reason differently from <think>, not only format its traces differently. The tests below measure that difference.

Two benchmarks, paired significance tests, and a mechanical check on pass@k (the chance that at least one of k sampled attempts is correct) say it reasons differently. Each panel sample covers more of the solution surface than each thinking sample. Successive panel samples add more marginal coverage than successive thinking samples. Same coin, two sides: search that's wider per sample and compounds over k. The cost is single-shot accuracy. Every property that depends on sampling more than once moves the other way.

Setup: one backbone, two scaffolds

The setup controls the comparison: same backbone, same evals, two reasoning formats.

The backbone

Qwen3-30B-A3B. A 30B-parameter mixture-of-experts model with roughly 3B active parameters per token. Three releases on the same architecture: a pure Base, a production hybrid that toggles <think> via a chat-template flag, and a dedicated Thinking-2507 variant.

The model under test

The panel adapter starts from Base. Pure RL, no SFT warmup, no distillation. The prompt instructs the model to deliberate inside <mutipersonaDebate>…</mutipersonaDebate> and commit a final answer inside <answer>…</answer>. Reward is outcome correctness plus a small tag-validity term. Two stages: 128 GSM8K gradient steps, then a few dozen on MATH. LoRA rank 32 throughout. That is the entire recipe.

# Panel prompt template (used at both train and eval time)
A conversation between User and Multi-Persona Panel of Experts.
The user asks a question, and the Panel solves it. The Panel first
deliberates and debates the reasoning process with each other and
then provides the user with the answer.

User: {problem}
Assistant: <mutipersonaDebate> … </mutipersonaDebate> <answer> … </answer>

The baseline

Qwen's production thinking model. Same base architecture, same parameter count, full proprietary post-training stack. I access it through apply_chat_template(enable_thinking=True), which routes the request to the post-trained policy that emits <think>…</think> before the answer. It is the closest peer the architecture allows.

Matched axes

Same backbone, same parameter count, same temperature (1.0), same benchmarks, same n, same grader. Two axes vary: the reasoning format each model was post-trained to emit, and the amount of post-training behind it.

Panel recipe
~200 steps
LoRA rank 32 · single node
Trainable params
~50 M
< 0.2% of 30B backbone
Qwen pipeline
SFT + RL
Industrial GPU budget · full weights
Post-train ratio
~10⁴–10⁵×
Baseline vs panel adapter · conservative

Measuring "different reasoning"

I defined diversity as per-problem semantic spread. Draw N completions from each model on the same problem at temperature 1.0, group them per problem and per model, and measure where they sit in semantic space relative to each other. A scaffold that explores more of the solution surface per sample should produce traces that sit further apart.

01 · extract Isolate the reasoning Strip the final answer. Keep the contents of <mutipersonaDebate> (panel) or <think> (thinking). Tag-extraction hit rate: 200/200 on both variants, with no fallback artifacts.
02 · embed Encode the opening First 2,000 characters per trace, sentence-transformers/all-mpnet-base-v2. The opening is where strategy is chosen; the rest is elaboration. Stable at an 8,000-char window too.
03 · score Mean pairwise distance For each group of N embeddings, compute mean pairwise cosine distance. Average across problems. Paired t-test on problem-level deltas: same problem, panel vs thinking, whose samples are further apart?
04 · sanity Discrete-mode check DBSCAN on the embeddings (eps=0.25, cosine). If spread goes up but cluster count is flat, the lift is continuous-variance: one strategy explored from more angles.

The pipeline ran on two benchmarks, paired problem-for-problem. Every comparison below is same-problem, same-n.

Wider search per sample

Mean pairwise cosine distance across the N traces a model produces on one problem proxies per-sample search volume: how much of the solution surface a single independent sample touches before committing. Higher distance means wider search. Panel's per-sample volume is larger than thinking's on both benchmarks, with matching relative effect sizes.

MATH500 Δ rel
+78.2%
t = 5.91 · p < 10⁻⁸
AIME Δ rel
+75.6%
t = 4.72 · p < 10⁻⁵
Per character
~6× wider
Panel uses ~⅛ the tokens
Pass@k closure
+15.6 pp
k=1 → k=16 on AIME
Table 01Per-variant diversity, paired on same problems
Benchmark Panel Thinking Δ abs Δ rel t n
MATH500 L1–L5 0.095 0.053 +0.042 +78.2% +5.91 50
AIME 2024+2025 0.119 0.068 +0.051 +75.6% +4.72 20

The two benchmarks sample different difficulty regimes. MATH500 is near-saturated for thinking (pass@3 = 1.000 on the stratified slice). AIME is harder; thinking solves 73% per sample. If both models were converging on the same easy answers, the diversity lift would shrink on AIME. It stays at the same scale.

Token length fails to explain the lift

Panel's traces are roughly 1,400 characters on AIME against thinking's 11,300. Thinking writes about 8× as much text per sample. Panel achieves its diversity lift in one-eighth the token budget. Per character, the spread is about 6× wider. The scaffold is more divergent at every length.

DBSCAN check

Cluster count holds at ~1.0 for both models on both benchmarks. The lift is continuous-variance, not discrete-mode. Both scaffolds commit to one strategy per problem; panel orbits that strategy across a wider arc.

The pass@k signature

An mpnet result on its own is fair to doubt. I looked for a mechanical consequence with no embeddings inside it. If panel covers more solution surface per sample, each additional panel sample should add more marginal coverage than each additional thinking sample. The two pass@k curves should converge as k grows.

Pass@k on AIME 2024+25 (n=16, 20 problems) Panel trails at every k, but the gap narrows monotonically: +15.6 pp closure from k=1 to k=16. panel adapter Qwen3 thinking 0% 25% 50% 75% 100% .225 .402 .483 .550 k = 1 k = 4 k = 8 k = 16 Δ −50.6pp Δ −44.6pp Δ −39.0pp Δ −35.0pp Closure: −50.6pp → −35.0pp = +15.6pp over 16× sampling.
Figure 1. Matched pass@k on AIME 2024+25, both models sampled at n=16 across 20 problems. Unbiased Codex-paper estimator: pass@k = 1 − C(n−c, k) / C(n, k). Panel never crosses thinking in the measured range, but the gap narrows monotonically with k, matching the diversity hypothesis.

The signature replicates on three independent slices, including AIME.

Table 02Gap closure across slices, the diversity signature replicates
Slice k range Gap at k=1 Gap at k_max Closure
MATH500 L1–L5 stratified k=1 → 4 −16.5pp −10.0pp +6.5pp
MATH500 L5 full sweep k=1 → 4 −32.5pp −17.8pp +14.7pp
AIME 2024+2025 k=1 → 16 −50.6pp −35.0pp +15.6pp

Three slices, three positive closures, all monotone in k. The pass@k mechanism agrees with the embedding measurement, and it does so without knowing anything about embeddings.

Token efficiency for deployment

The pass@k story says: at fixed sample count, panel closes the gap with k. Deployment adds a cost question: at fixed correctness, which scaffold uses fewer tokens? Cosine distances and pass@k curves are mechanism-level results. Tokens-per-correct-answer shows up on the bill.

I re-joined the existing eval rollouts pairwise on (problem, sample-index), using the same prompts, same temperature, and same max-token budgets, then read the completion_tokens field that sampling wrote out. No re-tokenization, no embeddings, no judgment calls. The comparison is paired by construction.

MATH500 L5 median ratio
6.91×
thinking ÷ panel · n = 306 paired both-correct
AIME median ratio
5.89×
thinking ÷ panel · n = 68 paired both-correct
MATH cost / correct
5.07×
thinking is more expensive per correct answer
AIME cost / correct
2.48×
thinking is more expensive per correct answer
Table 03Token-cost asymmetry, paired on (problem, sample-index)
Bucket n Panel tokens (median) Thinking tokens (median) Median ratio Thinking-longer in
MATH500 L5, both correct 306 652 4,958 +6.91× 306 / 306
MATH500 L5, panel-only correct 12 825 10,013 +14.17× 12 / 12
MATH500 L5, all paired 536 786 6,634 +7.65× 534 / 536
AIME 24+25, both correct 68 1,034 5,510 +5.89× 67 / 68
AIME 24+25, panel-only correct 4 1,116 16,384 +14.95× 4 / 4
AIME 24+25, all paired 320 1,069 11,977 +9.85× 319 / 320

Wilcoxon signed-rank on the both-correct subsets, pairing every (problem, sample-index) draw on the (think − panel) token-count diff: p = 6.4×10⁻⁵² on MATH500 (n=306), p = 8.0×10⁻¹³ on AIME (n=68). The directionality is so strong that the more useful summary is the count, not the p-value: thinking is longer in 99.6% of MATH paired draws and 99.7% of AIME paired draws, regardless of who's right.

Reframe by deployment cost. Total tokens spent in a benchmark run, divided by the number of correct answers it produced, gives a cost-per-correct figure that absorbs both the per-sample length and the hit rate into a single number:

Table 04Cost per correct answer (total tokens ÷ correct samples)
Benchmark Panel Thinking Δ
MATH500 L5 (n = 536 paired draws) 1,652 tok / correct 8,376 tok / correct thinking 5.07× more expensive
AIME 2024+2025 (n = 320 paired draws) 6,257 tok / correct 15,491 tok / correct thinking 2.48× more expensive

The two findings sit in tension only if the metric is per-sample accuracy. Thinking has higher per-sample accuracy on both benchmarks (90.9% vs 59.3% on MATH500 L5; 73.1% vs 22.5% on AIME). It also pays 2.5–5× more compute for each correct answer it produces. Per-sample-accuracy and cost-per-correct disagree because thinking's correct samples are too long to amortize the gap. Once compute is in the denominator, the comparison reverses.

Caveat: max-token cap

Thinking sampled at max_tokens=16384. Some thinking traces hit the cap mid-trace, which inflates the all-paired ratios. The both-correct subset is the conservative read: there, thinking ran to completion (AIME both-correct median = 5,510 tokens, MATH = 4,958, both well below 16k), and the median ratios of 5.89× and 6.91× come from completed traces rather than truncation. They sit downstream of the same diversity mechanism: three personas commit early on different paths and converge fast; one persona explores with backtracking.

Reproduce: python scripts/analyze_token_efficiency.py. Full per-bucket breakdown including means, geometric means, and minima/maxima at reports/token_efficiency/summary.json.

Bounds on the claim

I keep the claim inside these bounds.

Scope limit

Per-sample accuracy on the panel is lower than on thinking: about 33 pp lower on MATH500 L5, about 51 pp lower on AIME. In the k range I tested, the curves converge but never cross. On AIME, panel solves a strict subset of what thinking solves (11/20 vs 18/20, no panel-only wins). The MATH RL stage also compressed diversity from 0.174 to 0.095; the released checkpoint is an exploration-reliability compromise, not a pure-diversity win. The cost-per-correct comparison flips the per-sample-accuracy picture, but fixed sample budget still favors thinking, and that is the right denominator for some applications.

The reviewer question I most want answered next: is the effect the scaffold, or is it my specific RL recipe? Answering it requires a matched-compute <think>-scaffolded LoRA trained with the same reward, same step count, same base. That experiment is still open.

From wider search to faster hill-climbing

The original draft promised a third benchmark as the next step. The follow-up went further. Instead of a pass@k replication on a new dataset, I tested whether per-sample diversity converts into gradient signal under RLVR (reinforcement learning where the reward is whether the final answer is verifiably correct).

Group-relative RLVR algorithms (GRPO and RLOO, which score each rollout against the others in its batch rather than against a separate critic model) only produce non-zero advantage when a group of rollouts on the same problem contains both correct and incorrect samples. The variance band, problems with 0 < pass@G < 1, is the only regime that contributes to the policy gradient. Wider per-sample search predicts a wider variance band: more rollouts that fall outside both ceilings, more problems with usable advantage, more trainable data.

The pool: 877 olympiad problems from HMMT 2024+25, AIME 2024+25, the OlympiadBench math/English/competition split, and AMC. Each problem was sampled G=8 times per arm at temperature 1.0 and sorted into all_zero (none correct), variance_band (1–7 of 8 correct), or all_one (all 8 correct). Variance band carries gradient. The other two cells are training dead weight under group-relative RLVR.

Variance-band size on the gradient axis

On the same 877 problems, same G, same temperature:

Panel VB
382
43.6% of pool · 0 < pass@8 < 1
Thinking VB
209
23.8% of pool
VB ratio
1.83×
Panel / thinking
Token-cost ratio
~17×
Thinking ac_tokens / panel

Panel sees gradient signal on 382 of 877 problems. Thinking sees it on 209. The 1.83× ratio is the cleanest first-order test of the diversity claim, expressed on the property that controls RL throughput rather than embedding spread. The 17× token-cost ratio is the verbosity story sharpened: panel terminates compactly inside its debate; thinking runs near max_tokens on most problems.

Disjoint frontiers

The two arms' variance bands barely overlap. Cross-tabulating panel × thinking band assignments on the shared pool:

Table 03Joint band contingency on the 877-problem pool
panel \ thinking all_zero variance_band all_one row total
all_zero 232 137 108 477
variance_band 30 72 280 382
all_one 0 0 18 18
column total 262 209 406 877

Only 72 problems sit in both bands at once: 19% of panel's, 34% of thinking's. The largest off-diagonal cell on panel's row is the 280 problems where panel still has signal but thinking is saturated. The largest cell on thinking's column is the 137 problems where thinking has signal but panel is below the noise floor. The two scaffolds work in different parts of problem-space, and panel's part is wider. A joint training pool would have discarded the cells the diversity claim predicted.

Each arm therefore trains on its own variance band; both score on the same stratified 100-problem held-out (25 per source, drawn from the shared pool, excluded from both training pools). Hyperparameters held constant across arms: LoRA rank 32, lr 5e-6, group size 16, batch 8, temperature 1.0, max_tokens 12,288, 100 steps, eval_every 10.

Panel hill-climbs on its own variance band

Only the panel arm reached step 100. At ~17× the per-rollout token cost, the matched 100-step thinking run would have taken about 37 hours of Tinker compute against panel's ~2.6, and I hit a billing wall about three hours in. The panel trajectory:

Held-out pass@1 vs RL step (panel arm, 100 RL steps) +15 pp absolute (2.07×) over 100 RL steps · LoRA r32 on the panel variance band (354 train problems) held-out pass@1 (n=100) batch-0 baseline 0% 10% 20% 30% .140 .290 .290 batch 0 = .140 0 10 20 30 40 50 60 70 80 90 RL step Single-sample held-out (~3 pp aggregate stderr); per-source noise ~7–10 pp. Eval at every checkpoint, eval_every=10.
Figure 2. Panel held-out pass@1 across 100 RL steps on the panel variance band (354 training problems). Trajectory shape: brief weight-shock dip at batch 10, monotone climb to batch 50, step-jump at batch 70, plateau at 27-29%. Vanilla Qwen3-thinking sits at ~60% on the same held-out (off-chart); panel ends at about half of thinking's pre-RL pass rate, but more than doubles its own starting point.

Gains scale with training-pool representation

Panel's training pool is unevenly distributed across sources because each source contributes a different fraction of problems to panel's variance band. The held-out gains track that distribution closely:

Table 04Per-source held-out gains scale with training-pool representation
source % of panel_train batch 0 batch 90 Δ
OlympiadBench 88% (311/354) 20% 44% +24pp
AMC 10% (35/354) 32% 52% +20pp
AIME 2% (6/354) 4% 16% +12pp
HMMT 0.6% (2/354) 0% 4% +4pp
aggregate (n=100) 100% 14% 29% +15pp

Two readings matter. First: the absolute gain is largest where the training pool is densest (OlympiadBench, +24 pp on 88% of train). Second: every source moved, including HMMT, which had almost no direct training data. The pattern is transfer, not memorization. The 4 pp HMMT lift is one problem out of 25, so the absolute number sits at the noise floor on its own. The monotone ordering across sources carries the stronger evidence.

Diversity through RL

RL often kills diversity. I wanted to know whether 100 steps of RLVR would collapse the policy distribution and erase the property that made the scaffold useful. I tracked two diversity proxies across the run: per-token policy entropy and trajectory-length spread within a group of 16 rollouts on the same problem.

Token entropy (start)
0.591
batch 0–9 mean
Token entropy (end)
0.158
batch 90–99 mean (−73%)
Length CV (start)
0.36
ac_len within group, batch 0
Length CV (end)
0.66
batch 90 (+83%)

Entropy follows the textbook prediction. Length spread moves the other way. Panel becomes more confident per token while producing more varied trajectory lengths on the same problem. The signature reads as the three personas keeping their distinct verbosities even as each persona's per-token distribution sharpens.

Length CV is a cheap proxy, not a semantic one. I did not run the full mpnet-cosine measurement on a post-RL checkpoint, the same pipeline that produced the +78% / +76% numbers above. That measurement is in the open list at the bottom of the post. The directional read here fits the diversity story; the strong claim waits for the strong measurement.

RLVR scope

Honest scope

Next work

The Apr 25 update closes the cross-benchmark item from the original list. Variance-band classifications now exist on HMMT, AIME, OlympiadBench, and AMC, with held-out evaluation across all four sources. The remaining items, ordered by how much each would sharpen the claim:

Matched-compute <think> RL arm

Run a 100-step RL on the thinking variance band (180 problems) at the same hyperparameters and put both curves on the same chart. Wall-time estimate: ~37 hours of Tinker compute. The training pool is ready (data/olympiad_pool/thinking_train.jsonl). The run was started and paused at the billing wall. This is the experiment that decides whether the effect is the scaffold or the recipe.

Multi-seed on the panel hill-climbing run

+15 pp on a 100-problem held-out is one seed and a wide CI. A second seed under matched hyperparameters gives the rate-of-climb claim a real number. ~2.6 hours of compute, panel-only.

Held-out diversity post-RL

I measured RL-time entropy (collapsed) and length CV (grew). The semantic measurement on a post-RL checkpoint, the same mpnet-cosine pipeline that produced the +78% and +76% numbers earlier, is the missing piece. If those numbers survive 100 steps of RL, the scaffold's diversity is a stable equilibrium under gradient pressure rather than a starting condition.

Scaling study on n

Pushing panel to n=64 or n=128 on the L5 slice gives the asymptote, which decides between "diversity that catches up" and "diversity that plateaus below thinking." Current data points to the latter. The slope at n=16 remains positive.

· · ·

The headline holds with the addition. A LoRA rank-32 adapter and a few hundred gradient steps, four to five orders of magnitude under the baseline's post-training compute, widen per-sample search by 76-78% on two benchmarks. The widening compounds over k on three independent slices. On a fresh olympiad pool it puts 1.83× more problems in the gradient-signal-bearing band than the production thinking model. 100 RL steps on that band double the held-out pass rate. The mechanism that helped at inference time, more coverage per sample, helps at training time too, because group-relative RLVR only sees variance-band problems, and panel has more of them.

Scaffold choice is a lever, and it costs nothing at inference. The remaining questions narrow to three: matched <think> RL versus panel RL under the same recipe, semantic diversity on a post-RL checkpoint, and the asymptote at larger n. A later experiment can ask what a full post-training pipeline could do with a panel scaffold instead of against one.