May 2026

Verifier-guided prompt mining turned a zero-success 1.5B model into a transferable bug fixer with ten LoRA examples

A small open-weight instruction model (Qwen2.5-1.5B-Instruct, not a code-specialized model) produced zero correct fixes across 384 attempts at a synthetic Python bug-fix task. A prompt-optimization pass mined ten verified-correct fixes from the same model. A LoRA trained on those ten solved bug categories the model had never trained on.

TL;DR

Take an open-weight 1.5-billion-parameter language model (Qwen2.5-1.5B-Instruct). Ask it to fix bugs in 32 synthetic Python programs with hidden pytest cases as the verifier. The deployment prompt produces 0 verified-correct patches across 384 attempts. Switch to optimized prompts on the same model, and you get verified-correct fixes for 8 to 14 different bug tasks per run across 3 to 5 bug categories.

Train a LoRA adapter on ten of those patches for 24 steps. The adapter solves 3 to 4 of 6 unseen bug categories. The same setup with patches mined using only a pass/fail count solves 0 to 1. LoRA training cost: about thirty cents.

Where the training data already lives

The default story for adapting AI looks like this. Take an existing model, gather labeled examples of a new task, fine-tune the model on them, get a better model. The expensive part is the examples. Annotation companies exist to provide them, and synthetic-data pipelines exist to generate them. "We need more data" is the standard diagnosis when fine-tuning fails.

I tested whether the apparent shortage was structural or a prompting failure. It was a prompting failure. The training pass that recovered the missing skill cost about thirty cents on a Modal L4 GPU.

"In the model" is a slight overstatement. The data is reachable from the model in interaction with a verifier and its failure messages, not by the deployment prompt on its own. The model plus the verifier-feedback loop is where the data lives. That distinction matters because the verifier shapes what counts as correct, and (as the chart below shows) the quality of the feedback the optimizer sees changes what the mined examples teach.

Terms used below

What this enables

The same approach applies anywhere you want to fine-tune a small open-weight model on a verifier-friendly task and the deployment prompt is scoring zero.

  1. Code-repair tools with automated tests. The deployment prompt may not produce correct fixes on its own. The model may still know how, under different wording.
  2. Structured-extraction pipelines with schema validators. If a schema check is your verifier, prompt optimization can find wordings that produce schema-valid outputs the model couldn't otherwise emit.
  3. Any sparse-reward task where you were about to commission expensive annotation. Try the cheap experiment first. If prompt optimization mines verified-correct examples from the model, you may not need the annotation pass at all.

The approach does not apply when the deployment prompt is already producing correct outputs at a workable rate, or when no automated verifier exists for the task.

The result, in one chart

The chart compares two ways of running the mining pass. In the scalar condition, the prompt-optimization search saw only the count of tests each prompt candidate passed. In the localized condition, the search saw per-test-case failures with assertion errors. Both pools yielded ten verified-correct fixes; both LoRAs used the same training recipe. The transferability gap on held-out categories (3.67/6 vs 0.33/6) is the post's strongest practical lesson: diagnostic feedback quality, not just example count, determines whether mined examples teach generalizable behavior.

Repair success on six bug categories never seen during training Each LoRA trained on ten verified-correct fixes for 24 steps. Three random training samples per condition plus the mean. 0 / 6 1 / 6 2 / 6 3 / 6 4 / 6 5 / 6 6 / 6 0 4 0 3 1 4 0.33 3.67 sample 1 sample 2 sample 3 mean Scalar feedback Localized feedback
Three random samples of ten training fixes per condition, taken from the same verified-positive pool. The mean of localized-feedback samples is more than ten times the mean of scalar-feedback samples on bug categories absent from any training example.

Why this is not training on the test set

You might ask: if the AI produces the fixes and you train it on its own fixes, isn't that circular? It would be, but the verifier breaks the loop. The hidden pytest cases the model never sees decide which of its outputs count as correct. The training material is the subset of model outputs an automated grader accepts. That is different from training a model on its own opinion of itself.

The held-out test adds a second layer. Six of the 38 validation tasks belong to bug categories that do not appear in any training example for any condition. The trained model handles bug types it saw zero examples of during training.

Per-condition numbers

StageStatic promptScalar-feedback promptLocalized-feedback prompt
Verified-correct patches per run (out of 128)022 to 2623 to 33
Unique bug tasks solved during mining08 to 912 to 14
Unique bug categories covered034 to 5
Validation repair success after LoRA (mean of 3 samples)0 / 3815 / 3822 / 38
Held-out-category repair success (mean of 3 samples)0 / 60.33 / 63.67 / 6

Bounds

One intervention on one model. I tested one open-weight family at one size (1.5B parameters, instruction-tuned), trained on one synthetic Python code-repair benchmark with hidden pytest verification, and evaluated on a 38-task validation suite with 6 items in the toughest held-out-category split. Any individual number could move with a different random sample. The held-out tasks belong to bug categories the LoRA never saw, but they share the synthetic benchmark's task grammar, file structure, and hidden-test style. The claim is cross-category transfer inside a synthetic repair distribution, not broad code-repair generalization. I have not shown the result transfers to real codebases, larger models, multi-file bugs, or anything that is not code.

The most important baseline I have not run: take the same ten mined examples and use them as few-shot in the prompt, without LoRA. If few-shot prompting reaches similar held-out performance, the LoRA is mostly compression and deployment convenience. If LoRA beats few-shot under equal context cost, the adapter story is stronger. That comparison is open.

Two things I stand behind. The direction of the result reproduced across three random training samples and a fresh seed/split mining package. The methodology, including the adapter overfit gate that caught a training-format bug before the matched-pool comparison ran, is reconstructible from the source artifacts.

Close

The conventional move when fine-tuning fails is to gather more training data. On one task surface, the bottleneck was example visibility instead. The information needed to teach a small model a generalizable skill sat inside the model already. The everyday wording of the request was not reaching it. A better wording reached it. The automated grader confirmed which answers were real. Ten of those answers were enough to retrain the model into something usable on bugs it had never seen.

If you are fine-tuning a small open-weight model for a verifier-friendly task and you cannot mine a single training example from the deployment prompt, the deployment prompt may not be the only question to ask. Try the others before commissioning the corpus.