The diversity effect
A panel-of-experts scaffold changes RL-trained reasoning by widening per-sample search, then turns that diversity into RLVR signal on olympiad math.
I post-trained the same backbone two ways. Trained as a panel of experts arguing inside a debate tag, it produces reasoning traces that sit +78.2% further apart in mpnet-space on MATH500 (paired t = 5.91, p < 10⁻⁸) and +75.6% further apart on AIME (t = 4.72, p < 10⁻⁵) than the same model post-trained to think privately. The pass@k gap to Qwen3-thinking narrows monotonically in k across three independent slices.
Deployment cost. On the 374 paired draws across MATH500 L5 and AIME 24+25 where both arms produced a correct answer, thinking uses 5.9-6.9× more tokens (median) than the panel, and it runs longer than the panel in 99.6-99.7% of all paired draws regardless of correctness. At the cost-per-correct-answer level, total tokens divided by correct samples, thinking is 2.5-5× more expensive than the panel. Per-sample accuracy still favors thinking; per-token efficiency reverses the comparison. (Wilcoxon signed-rank: p = 6×10⁻⁵² on MATH, p = 8×10⁻¹³ on AIME.)
Apr 25 update. Wider per-sample search shows up where it matters for RL. On a fresh 877-problem olympiad pool the panel has 382 variance-band problems against thinking's 209: the only regime where group-relative RLVR generates non-zero gradient, and a 1.83× ratio in panel's favor. 100 RL steps on that band carry panel from 14% to 29% on a shared held-out. That's about half of vanilla thinking's pre-RL ~60% on the same problems, but more than double panel's own starting point. Training adapter: LoRA rank 32 on the frozen base, four to five orders of magnitude under the baseline's post-training compute. A matched-compute thinking LoRA would distinguish scaffold from RL recipe. That experiment has not yet run.
Scaffold work, measured
The default reasoning scaffold is <think>: a long private monologue, often tens of thousands of tokens, before the answer. Every frontier lab ships a thinking model. The format has become invisible the way "let's think step by step" became invisible a few years before it.
But the format is a choice. Train a base model to deliberate as a panel of experts instead, and the comparison the field will ask for first is obvious.
The panel scaffold must reason differently from <think>, not only format its traces differently. The tests below measure that difference.
Two benchmarks, paired significance tests, and a mechanical check on pass@k (the chance that at least one of k sampled attempts is correct) say it reasons differently. Each panel sample covers more of the solution surface than each thinking sample. Successive panel samples add more marginal coverage than successive thinking samples. Same coin, two sides: search that's wider per sample and compounds over k. The cost is single-shot accuracy. Every property that depends on sampling more than once moves the other way.
Setup: one backbone, two scaffolds
The setup controls the comparison: same backbone, same evals, two reasoning formats.
The backbone
Qwen3-30B-A3B. A 30B-parameter mixture-of-experts model with roughly 3B active parameters per token. Three releases on the same architecture: a pure Base, a production hybrid that toggles <think> via a chat-template flag, and a dedicated Thinking-2507 variant.
The model under test
The panel adapter starts from Base. Pure RL, no SFT warmup, no distillation. The prompt instructs the model to deliberate inside <mutipersonaDebate>…</mutipersonaDebate> and commit a final answer inside <answer>…</answer>. Reward is outcome correctness plus a small tag-validity term. Two stages: 128 GSM8K gradient steps, then a few dozen on MATH. LoRA rank 32 throughout. That is the entire recipe.
# Panel prompt template (used at both train and eval time)
A conversation between User and Multi-Persona Panel of Experts.
The user asks a question, and the Panel solves it. The Panel first
deliberates and debates the reasoning process with each other and
then provides the user with the answer.
User: {problem}
Assistant: <mutipersonaDebate> … </mutipersonaDebate> <answer> … </answer>
The baseline
Qwen's production thinking model. Same base architecture, same parameter count, full proprietary post-training stack. I access it through apply_chat_template(enable_thinking=True), which routes the request to the post-trained policy that emits <think>…</think> before the answer. It is the closest peer the architecture allows.
Same backbone, same parameter count, same temperature (1.0), same benchmarks, same n, same grader. Two axes vary: the reasoning format each model was post-trained to emit, and the amount of post-training behind it.
Measuring "different reasoning"
I defined diversity as per-problem semantic spread. Draw N completions from each model on the same problem at temperature 1.0, group them per problem and per model, and measure where they sit in semantic space relative to each other. A scaffold that explores more of the solution surface per sample should produce traces that sit further apart.
<mutipersonaDebate> (panel) or <think> (thinking). Tag-extraction hit rate: 200/200 on both variants, with no fallback artifacts.
sentence-transformers/all-mpnet-base-v2. The opening is where strategy is chosen; the rest is elaboration. Stable at an 8,000-char window too.
The pipeline ran on two benchmarks, paired problem-for-problem. Every comparison below is same-problem, same-n.
Wider search per sample
Mean pairwise cosine distance across the N traces a model produces on one problem proxies per-sample search volume: how much of the solution surface a single independent sample touches before committing. Higher distance means wider search. Panel's per-sample volume is larger than thinking's on both benchmarks, with matching relative effect sizes.
| Benchmark | Panel | Thinking | Δ abs | Δ rel | t | n |
|---|---|---|---|---|---|---|
| MATH500 L1–L5 | 0.095 | 0.053 | +0.042 | +78.2% | +5.91 | 50 |
| AIME 2024+2025 | 0.119 | 0.068 | +0.051 | +75.6% | +4.72 | 20 |
The two benchmarks sample different difficulty regimes. MATH500 is near-saturated for thinking (pass@3 = 1.000 on the stratified slice). AIME is harder; thinking solves 73% per sample. If both models were converging on the same easy answers, the diversity lift would shrink on AIME. It stays at the same scale.
Token length fails to explain the lift
Panel's traces are roughly 1,400 characters on AIME against thinking's 11,300. Thinking writes about 8× as much text per sample. Panel achieves its diversity lift in one-eighth the token budget. Per character, the spread is about 6× wider. The scaffold is more divergent at every length.
Cluster count holds at ~1.0 for both models on both benchmarks. The lift is continuous-variance, not discrete-mode. Both scaffolds commit to one strategy per problem; panel orbits that strategy across a wider arc.
The pass@k signature
An mpnet result on its own is fair to doubt. I looked for a mechanical consequence with no embeddings inside it. If panel covers more solution surface per sample, each additional panel sample should add more marginal coverage than each additional thinking sample. The two pass@k curves should converge as k grows.
The signature replicates on three independent slices, including AIME.
| Slice | k range | Gap at k=1 | Gap at k_max | Closure |
|---|---|---|---|---|
| MATH500 L1–L5 stratified | k=1 → 4 | −16.5pp | −10.0pp | +6.5pp |
| MATH500 L5 full sweep | k=1 → 4 | −32.5pp | −17.8pp | +14.7pp |
| AIME 2024+2025 | k=1 → 16 | −50.6pp | −35.0pp | +15.6pp |
Three slices, three positive closures, all monotone in k. The pass@k mechanism agrees with the embedding measurement, and it does so without knowing anything about embeddings.
Token efficiency for deployment
The pass@k story says: at fixed sample count, panel closes the gap with k. Deployment adds a cost question: at fixed correctness, which scaffold uses fewer tokens? Cosine distances and pass@k curves are mechanism-level results. Tokens-per-correct-answer shows up on the bill.
I re-joined the existing eval rollouts pairwise on (problem, sample-index), using the same prompts, same temperature, and same max-token budgets, then read the completion_tokens field that sampling wrote out. No re-tokenization, no embeddings, no judgment calls. The comparison is paired by construction.
| Bucket | n | Panel tokens (median) | Thinking tokens (median) | Median ratio | Thinking-longer in |
|---|---|---|---|---|---|
| MATH500 L5, both correct | 306 | 652 | 4,958 | +6.91× | 306 / 306 |
| MATH500 L5, panel-only correct | 12 | 825 | 10,013 | +14.17× | 12 / 12 |
| MATH500 L5, all paired | 536 | 786 | 6,634 | +7.65× | 534 / 536 |
| AIME 24+25, both correct | 68 | 1,034 | 5,510 | +5.89× | 67 / 68 |
| AIME 24+25, panel-only correct | 4 | 1,116 | 16,384 | +14.95× | 4 / 4 |
| AIME 24+25, all paired | 320 | 1,069 | 11,977 | +9.85× | 319 / 320 |
Wilcoxon signed-rank on the both-correct subsets, pairing every (problem, sample-index) draw on the (think − panel) token-count diff: p = 6.4×10⁻⁵² on MATH500 (n=306), p = 8.0×10⁻¹³ on AIME (n=68). The directionality is so strong that the more useful summary is the count, not the p-value: thinking is longer in 99.6% of MATH paired draws and 99.7% of AIME paired draws, regardless of who's right.
Reframe by deployment cost. Total tokens spent in a benchmark run, divided by the number of correct answers it produced, gives a cost-per-correct figure that absorbs both the per-sample length and the hit rate into a single number:
| Benchmark | Panel | Thinking | Δ |
|---|---|---|---|
| MATH500 L5 (n = 536 paired draws) | 1,652 tok / correct | 8,376 tok / correct | thinking 5.07× more expensive |
| AIME 2024+2025 (n = 320 paired draws) | 6,257 tok / correct | 15,491 tok / correct | thinking 2.48× more expensive |
The two findings sit in tension only if the metric is per-sample accuracy. Thinking has higher per-sample accuracy on both benchmarks (90.9% vs 59.3% on MATH500 L5; 73.1% vs 22.5% on AIME). It also pays 2.5–5× more compute for each correct answer it produces. Per-sample-accuracy and cost-per-correct disagree because thinking's correct samples are too long to amortize the gap. Once compute is in the denominator, the comparison reverses.
Thinking sampled at max_tokens=16384. Some thinking traces hit the cap mid-trace, which inflates the all-paired ratios. The both-correct subset is the conservative read: there, thinking ran to completion (AIME both-correct median = 5,510 tokens, MATH = 4,958, both well below 16k), and the median ratios of 5.89× and 6.91× come from completed traces rather than truncation. They sit downstream of the same diversity mechanism: three personas commit early on different paths and converge fast; one persona explores with backtracking.
Reproduce: python scripts/analyze_token_efficiency.py. Full per-bucket breakdown including means, geometric means, and minima/maxima at reports/token_efficiency/summary.json.
Bounds on the claim
I keep the claim inside these bounds.
Per-sample accuracy on the panel is lower than on thinking: about 33 pp lower on MATH500 L5, about 51 pp lower on AIME. In the k range I tested, the curves converge but never cross. On AIME, panel solves a strict subset of what thinking solves (11/20 vs 18/20, no panel-only wins). The MATH RL stage also compressed diversity from 0.174 to 0.095; the released checkpoint is an exploration-reliability compromise, not a pure-diversity win. The cost-per-correct comparison flips the per-sample-accuracy picture, but fixed sample budget still favors thinking, and that is the right denominator for some applications.
The reviewer question I most want answered next: is the effect the scaffold, or is it my specific RL recipe? Answering it requires a matched-compute <think>-scaffolded LoRA trained with the same reward, same step count, same base. That experiment is still open.
From wider search to faster hill-climbing
The original draft promised a third benchmark as the next step. The follow-up went further. Instead of a pass@k replication on a new dataset, I tested whether per-sample diversity converts into gradient signal under RLVR (reinforcement learning where the reward is whether the final answer is verifiably correct).
Group-relative RLVR algorithms (GRPO and RLOO, which score each rollout against the others in its batch rather than against a separate critic model) only produce non-zero advantage when a group of rollouts on the same problem contains both correct and incorrect samples. The variance band, problems with 0 < pass@G < 1, is the only regime that contributes to the policy gradient. Wider per-sample search predicts a wider variance band: more rollouts that fall outside both ceilings, more problems with usable advantage, more trainable data.
The pool: 877 olympiad problems from HMMT 2024+25, AIME 2024+25, the OlympiadBench math/English/competition split, and AMC. Each problem was sampled G=8 times per arm at temperature 1.0 and sorted into all_zero (none correct), variance_band (1–7 of 8 correct), or all_one (all 8 correct). Variance band carries gradient. The other two cells are training dead weight under group-relative RLVR.
Variance-band size on the gradient axis
On the same 877 problems, same G, same temperature:
Panel sees gradient signal on 382 of 877 problems. Thinking sees it on 209. The 1.83× ratio is the cleanest first-order test of the diversity claim, expressed on the property that controls RL throughput rather than embedding spread. The 17× token-cost ratio is the verbosity story sharpened: panel terminates compactly inside its debate; thinking runs near max_tokens on most problems.
Disjoint frontiers
The two arms' variance bands barely overlap. Cross-tabulating panel × thinking band assignments on the shared pool:
| panel \ thinking | all_zero | variance_band | all_one | row total |
|---|---|---|---|---|
| all_zero | 232 | 137 | 108 | 477 |
| variance_band | 30 | 72 | 280 | 382 |
| all_one | 0 | 0 | 18 | 18 |
| column total | 262 | 209 | 406 | 877 |
Only 72 problems sit in both bands at once: 19% of panel's, 34% of thinking's. The largest off-diagonal cell on panel's row is the 280 problems where panel still has signal but thinking is saturated. The largest cell on thinking's column is the 137 problems where thinking has signal but panel is below the noise floor. The two scaffolds work in different parts of problem-space, and panel's part is wider. A joint training pool would have discarded the cells the diversity claim predicted.
Each arm therefore trains on its own variance band; both score on the same stratified 100-problem held-out (25 per source, drawn from the shared pool, excluded from both training pools). Hyperparameters held constant across arms: LoRA rank 32, lr 5e-6, group size 16, batch 8, temperature 1.0, max_tokens 12,288, 100 steps, eval_every 10.
Panel hill-climbs on its own variance band
Only the panel arm reached step 100. At ~17× the per-rollout token cost, the matched 100-step thinking run would have taken about 37 hours of Tinker compute against panel's ~2.6, and I hit a billing wall about three hours in. The panel trajectory:
Gains scale with training-pool representation
Panel's training pool is unevenly distributed across sources because each source contributes a different fraction of problems to panel's variance band. The held-out gains track that distribution closely:
| source | % of panel_train | batch 0 | batch 90 | Δ |
|---|---|---|---|---|
| OlympiadBench | 88% (311/354) | 20% | 44% | +24pp |
| AMC | 10% (35/354) | 32% | 52% | +20pp |
| AIME | 2% (6/354) | 4% | 16% | +12pp |
| HMMT | 0.6% (2/354) | 0% | 4% | +4pp |
| aggregate (n=100) | 100% | 14% | 29% | +15pp |
Two readings matter. First: the absolute gain is largest where the training pool is densest (OlympiadBench, +24 pp on 88% of train). Second: every source moved, including HMMT, which had almost no direct training data. The pattern is transfer, not memorization. The 4 pp HMMT lift is one problem out of 25, so the absolute number sits at the noise floor on its own. The monotone ordering across sources carries the stronger evidence.
Diversity through RL
RL often kills diversity. I wanted to know whether 100 steps of RLVR would collapse the policy distribution and erase the property that made the scaffold useful. I tracked two diversity proxies across the run: per-token policy entropy and trajectory-length spread within a group of 16 rollouts on the same problem.
Entropy follows the textbook prediction. Length spread moves the other way. Panel becomes more confident per token while producing more varied trajectory lengths on the same problem. The signature reads as the three personas keeping their distinct verbosities even as each persona's per-token distribution sharpens.
Length CV is a cheap proxy, not a semantic one. I did not run the full mpnet-cosine measurement on a post-RL checkpoint, the same pipeline that produced the +78% / +76% numbers above. That measurement is in the open list at the bottom of the post. The directional read here fits the diversity story; the strong claim waits for the strong measurement.
RLVR scope
- Claim: the variance-band advantage the diversity claim predicts is real (382 vs 209 on the 877-problem pool). It converts to RL gradient signal: 100 steps double panel's olympiad held-out pass rate (14% → 29%), with per-source gains scaling with training-pool representation. The diversity reservoir survives the run on the proxy I measured: per-token entropy collapses, but length spread within a group grows.
- Limit: I did not finish the matched
<think>RL comparison. At ~17× the per-rollout token cost, the matched 100-step thinking run would have taken about 37 hours of Tinker compute, and I hit the billing wall first. The static reference is vanilla Qwen3-thinking on the shared held-out: ~60% pass@1 aggregate, ~55% on OlympiadBench, ~75% on AMC, all at G=8. Panel ends at 29%, about half. The result here is about rate of climb under matched hyperparameters, not absolute capability. - Single seed, single run: the +15 pp number is one seed. The curve's shape replicates across the per-source breakdown; the absolute number wants a second seed before its CI gets reported.
Next work
The Apr 25 update closes the cross-benchmark item from the original list. Variance-band classifications now exist on HMMT, AIME, OlympiadBench, and AMC, with held-out evaluation across all four sources. The remaining items, ordered by how much each would sharpen the claim:
Matched-compute <think> RL arm
Run a 100-step RL on the thinking variance band (180 problems) at the same hyperparameters and put both curves on the same chart. Wall-time estimate: ~37 hours of Tinker compute. The training pool is ready (data/olympiad_pool/thinking_train.jsonl). The run was started and paused at the billing wall. This is the experiment that decides whether the effect is the scaffold or the recipe.
Multi-seed on the panel hill-climbing run
+15 pp on a 100-problem held-out is one seed and a wide CI. A second seed under matched hyperparameters gives the rate-of-climb claim a real number. ~2.6 hours of compute, panel-only.
Held-out diversity post-RL
I measured RL-time entropy (collapsed) and length CV (grew). The semantic measurement on a post-RL checkpoint, the same mpnet-cosine pipeline that produced the +78% and +76% numbers earlier, is the missing piece. If those numbers survive 100 steps of RL, the scaffold's diversity is a stable equilibrium under gradient pressure rather than a starting condition.
Scaling study on n
Pushing panel to n=64 or n=128 on the L5 slice gives the asymptote, which decides between "diversity that catches up" and "diversity that plateaus below thinking." Current data points to the latter. The slope at n=16 remains positive.
The headline holds with the addition. A LoRA rank-32 adapter and a few hundred gradient steps, four to five orders of magnitude under the baseline's post-training compute, widen per-sample search by 76-78% on two benchmarks. The widening compounds over k on three independent slices. On a fresh olympiad pool it puts 1.83× more problems in the gradient-signal-bearing band than the production thinking model. 100 RL steps on that band double the held-out pass rate. The mechanism that helped at inference time, more coverage per sample, helps at training time too, because group-relative RLVR only sees variance-band problems, and panel has more of them.
Scaffold choice is a lever, and it costs nothing at inference. The remaining questions narrow to three: matched <think> RL versus panel RL under the same recipe, semantic diversity on a post-RL checkpoint, and the asymptote at larger n. A later experiment can ask what a full post-training pipeline could do with a panel scaffold instead of against one.