May 2026

A tiny RL pass recovers cross-task answer diversity in an instruction-tuned model

A targeted reinforcement-learning pass on Qwen3-30B-A3B-Instruct restored output diversity across ten distinct tasks while leaving knowledge and reasoning benchmarks intact.

GitHub repo LoRA adapter

TL;DR

Take an open-weight instruction-tuned model (Qwen3-30B-A3B-Instruct). Train it for 50 reinforcement-learning steps on one task: "pick a random integer between 1 and 100." Output diversity improves on that task and on nine other tasks the model was never trained for: random color, fruit, animal, first name, word, emoji, card suit, integer 1–10, integer 1–1000.

MMLU stays flat. GSM8K stays flat. Compute cost: about $25.

Where output diversity quietly disappears

Brainstorming assistants. Creative-generation pipelines. Randomized exploration in agentic systems. Varied sampling in eval harnesses. Each one depends on the model producing different answers when you ask the same question twice. Each one loses that property after the standard "make this model helpful" training pass.

When the Qwen3-30B-A3B instruct model I tested is asked to pick a random integer between 1 and 100, it puts more than 95% of its probability on just three numbers: 4, 42, and 47. The matched base model spreads its mass across the full range.

I tested whether the loss is structural or recoverable. The loss is recoverable. The training pass that recovers it cost me about $25.

Terms used below

Base model. Trained only on next-token prediction. Does not follow instructions.
Instruct model. Base model further trained to follow user instructions (RLHF or DPO).
Post-training. The pipeline that turns a base model into an instruct model.
Output diversity. How spread the model's probability mass is across valid answers. 0 = flat, 1 = one answer every time. Total variation distance to uniform; lower is more diverse.
Temperature. A sampling parameter that softens the model's output distribution. T=1.0 samples from the raw probabilities. T>1.0 flattens them, raising diversity at some cost to coherence. T<1.0 sharpens them.
Transfer. Improvement on a task the model was never trained for.
MoE / A3B. Mixture-of-experts; A3B = 3 billion active parameters per token (out of 30 billion total in Qwen3-30B-A3B).
GRPO. Group Relative Policy Optimization. The algorithm I used to train. For each prompt, the model generates a small group of responses; each response is scored; the model is nudged toward responses that scored higher than the others in the same group. In this experiment the score rewarded outputs that looked more spread out across the answer space.

What this enables

This kind of training pass applies where your product depends on the model producing different responses to the same prompt.

Brainstorming and creative-generation pipelines. When five "creative variations" come back looking the same, you are looking at post-training narrowing.
Diverse sampling in agentic systems. Tool-use loops and planning agents suffer when the model defaults to one attempt on every prompt.
Randomization in evaluation harnesses. If your eval shuffles answer options and the model converges on the same letter regardless, your scores are measuring post-training narrowing as much as the underlying capability.
Synthetic data generation for distillation. You cannot distill diverse behavior out of a generator that does not have it.

The result, in one chart

One row per task. Hollow dot is the model before training; filled dot is the same model after 50 RL steps. The training task is highlighted in blue. Every line points the same direction.

The chart shows aggregate uniformity. Here are the actual top picks behind those numbers.

integer 1–100 (trained)

vanilla42 · 47 · 1 · 2 · 3

trained15 · 36 · 68 · 46 · 63

color

vanillablue · green · red · orange · yellow

trainedblue · gray · cyan · purple · magenta

fruit

vanillabanana · mango · apple · watermelon · peach

trainedmelon · grape · mango · kiwi · apple

Three tasks, top-5 most-frequent picks from 50 samples per condition at temperature 1.0. The vanilla model concentrates on the canonical "random" picks for each domain. The trained model spreads to less obvious values; the trained task is highlighted.

Why this is not just per-task fine-tuning, or temperature scaling

You can fine-tune a model for each task you care about. I did not. I trained on one task and nine others improved. A per-task fine-tune would mean nine more training runs.

The cheapest task-specific fix needs no training at all: post-hoc logit correction. Subtract a per-choice bias from the model's output scores so the resulting distribution lands flat on whichever task you fit it to. I ran that as a baseline. On any single task it sets diversity to zero. But the bias vector for "pick an integer 1–100" has no defined meaning on "pick a color." Logit correction cannot transfer across tasks. To produce the result I measured, the mechanism has to act across task families. A parameter update is one such mechanism.

Or temperature scaling? Raising sampling temperature is the cheapest diversity knob available. It doesn't reach this. Sampling all ten tasks at vanilla T=1.5 narrows the gap to trained-at-T=1.0 but closes only about a quarter of it. Mean TV-to-uniform at vanilla T=1.5 hovers around 0.71; the trained model at T=1.0 sits near 0.43. At T=1.5, ordinary temperature scaling did not replicate what training does on these tasks.

Beyond temperature there are stronger inference-time methods: contrastive decoding, base-model-assisted decoding, rejection sampling. They can recover diversity without changing weights, but they pay at inference time: extra models in the serving path, extra passes, or filtering. This experiment tests a different trade-off: pay once with a small parameter update, then sample normally. I have not run a head-to-head against those methods.

One training pass adjusted something the model uses for all ten of these tasks, not only the trained one. I don't know exactly what. The likely-but-unproven story: post-training nudges some shared internal setting toward narrower outputs across the board, and a short RL pass with a "reward more spread" signal nudges it back. The evidence is consistent with that. I have not isolated which weights changed.

Per-task numbers

Task	Before	After
integer 1–10	0.890	0.115
fruit	0.461	0.194
first name	0.607	0.300
color	0.889	0.311
card suit	0.712	0.370
integer 1–100 (trained)	0.953	0.413
emoji	0.771	0.420
animal	0.620	0.522
integer 1–1000	0.942	0.571
word	0.770	0.581

What happens on reasoning tasks

The ten tasks above are all categorical: pick a fruit, pick a color, pick an integer. The diversity lives in the answer space itself. Other product tasks are not shaped like that. Math reasoning has a single correct answer; the diversity lives in the path to it.

I sampled both vanilla and trained on 25 GSM8K problems, k=10 chains of thought per problem, at two temperatures. Among the correct chains in each cell I counted how many distinct calculation paths appeared (different sequences of intermediate numbers):

Condition	Distinct paths (of 10)	Correct (of 10)
vanilla, T=1.0	7.4	9.8
vanilla, T=1.5	8.4	9.7
trained, T=1.0	8.4	9.8
trained, T=1.5	8.7	9.4

The picture flips. Vanilla at T=1.5 reaches the same raw diversity training does. Both push toward a soft ceiling near 8.7 distinct paths in a 10-sample draw. Training and temperature partially substitute on this task; combining them barely composes.

Accuracy is where they diverge. Training buys diversity for free: same 9.8 of 10 correct as the baseline at the same temperature. Temperature buys it for a small accuracy tax. Stacking them taxes more.

The result splits by task family. Where the answer space itself is the diversity target, training does something raising temperature to 1.5 did not. Where the answer is fixed and diversity lives in the path, training and temperature trade off in raw diversity, and training is the better Pareto choice because it preserves accuracy.

Bounds

This is one intervention on one model. I tested one open-weight family (Qwen3) at one size (30B total parameters, 3B active per token), trained on one task from a ten-task panel, and evaluated against a subset of standard capability benchmarks. Replicate with the model you ship before betting product strategy on the finding.

What's measured on the categorical tasks is total variation distance from uniform: distribution flattening, not necessarily useful diversity. For finite dice-like prompts (integers, card suits) uniform over valid answers is the right target. For open semantic domains (fruit, animal, word) "uniform over every valid answer" is not obviously what a product wants. The GSM8K test above gives a partial answer for reasoning tasks; for open-ended generation (essays, brainstorming, product naming) I have not measured.

Knowledge and reasoning held steady on MMLU (200 questions) and GSM8K (50 problems). Those are smoke tests, not a deployment screen. A diversity-restoration update could affect instruction-following, calibration, safety behavior, tool-use reliability, or JSON formatting without moving either benchmark. The training compute was about $25; the operational cost of shipping a forked model (regression testing, safety review, version management) is separate and not free.

Four things I stand behind: the direction of the effect, the cross-task transfer on these categorical prompts, the fact that knowledge and reasoning held steady across the benchmarks I ran, and that the trained model produces more diverse GSM8K reasoning paths than vanilla at the same temperature without losing accuracy.

Close

The conventional story about RLHF: it makes models better behaved at the cost of creativity, and the trade-off is fundamental. The trade-off is more local than that. Fifty training steps on one task substantially recover the diversity across nine others. Knowledge and reasoning hold.

If your product depends on the model emitting different responses to the same prompt, you have a knob you did not know existed. On tasks where the answer space is the diversity target, the knob does something the temperature knob cannot. On tasks where diversity lives in the reasoning path, it is the accuracy-preserving alternative to turning temperature up. The instruct model you ship today does not have to stay narrow.