Hidden activations know when self-report does not
A classifier on Llama 3.1 8B's internal activations recovers claim-level correctness signal that the model's own confidence answers miss across legal QA and biography factuality.
Llama 3.1 8B misses unreliable claims in its own long-form answers when you ask it directly. A classifier trained on the model's internal activations does much better. On MAUD merger-agreement QA, asking the model reaches AUROC 0.511 (chance is 0.5); the classifier reaches 0.771, a paired bootstrap gain of +0.260 [0.137, 0.387]. On FActScore biographies under human atomic-fact labels, the gap reproduces: self-report 0.541, classifier 0.802, paired delta +0.262 [0.219, 0.302].
The narrow claim: hidden activations carry recoverable claim-level correctness signal that the model's verbal confidence does not surface. The effect size matches across two settings with different label sources, generation sources, and content domains. FELM points the same way but remains statistically inconclusive; its matched-generation repair shows reference-bounded labels collapsing into evidence-coverage measurements once generated claims exceed the reference scope.
Self-report misses local failure
Long answers fail claim by claim. One sentence is right, the next is partly right, the next is confidently wrong. A user who depends on the answer needs to know which is which. A single answer-level confidence number cannot do that job.
The simplest claim-level baseline is to ask the model. Show Llama 3.1 8B a fixed claim from its own answer and ask for structured confidence. If self-report ranks correct claims above incorrect ones, a probe-based system is wasted complexity. When self-report fails, the alternative has to come from somewhere else inside the same forward pass.
Llama's structured self-report fails to rank claim correctness in these runs. A linear readout of its hidden activations on the same claims recovers that signal, with a large gap.
The probe stays simple: layer-19 residual-stream vectors pooled over the claim span, standardized, fed to L2 logistic regression. No attention head selection, no SAE feature search, no learned pooling. I wanted to test the basic geometric assumption before reaching for anything more expressive.
Setup: one probed model, three label regimes
The probed model is Llama 3.1 8B Instruct throughout. The dataset and label source change. MAUD provides 150 fixed legal claims scored by two LLM judges. FActScore provides 4,886 human-labeled atomic facts mapped to parent biography sentences. FELM provides human segment labels on world-knowledge QA, where the matched-generation repair exposed an annotation-target problem.
not_enough_evidence.
The two protocols differ because the sample sizes differ. MAUD's 150 claims do not support a clean train/validation/test split, so the headline MAUD probe uses leave-one-out evaluation over the frozen full set. FActScore's 4,886 facts support a 70/15/15 split by biography, with 3,353 train, 791 validation, and 742 test claims. I compare the qualitative gap, not identical pipelines.
Same gap, two domains
The simplest evaluation is whether each score ranks correct claims above non-correct claims. AUROC is 0.5 at chance. Higher is better. Brier score on the same scores measures how well the probability magnitudes match the binary label.
| Dataset | Label source | Self-report | Residual probe | Paired Δ |
|---|---|---|---|---|
| MAUD | GPT-5.4 judge proxy | 0.511 [0.411, 0.597] | 0.771 [0.687, 0.838] | +0.260 [0.137, 0.387] |
| FActScore | Human atomic facts | 0.541 [0.514, 0.570] | 0.802 [0.769, 0.833] | +0.262 [0.219, 0.302] |
| FELM-wk | Human segments | 0.511 [0.393, 0.673] | 0.652 [0.493, 0.815] | +0.141 [-0.063, 0.320] |
Replication scope
MAUD uses leave-one-out over a small judge-labeled legal claim set. FActScore uses a 70/15/15 split by biography over a much larger human-labeled biography set. I do not force 150 rows to imitate 4,886. The replicated object is the qualitative gap between self-report and residual activations, with paired effect sizes of +0.260 and +0.262 AUROC. Treat the agreement as evidence about the signal rather than the pipeline.
Second judge sensitivity
MAUD's first labels come from a GPT-5.4 judge-proxy protocol. Probes trained on those labels could be fitting whatever GPT-5.4 happens to score consistently rather than something more general. The cleanest test is to relabel the same fixed claims with a different judge family and rescore the same probes without retraining. I use Kimi K2.6 through Prime Intellect Inference, with the prompt structure and claim serialization preserved.
| Method | GPT-5.4 AUROC | Kimi AUROC | Agreement-set AUROC | Agreement Brier |
|---|---|---|---|---|
| Llama self-report | 0.511 [0.411, 0.597] | 0.466 [0.413, 0.604] | 0.486 [0.392, 0.598] | 0.509 [0.400, 0.600] |
| GPT-5.4 scorer | 0.944 [0.906, 0.984] | 0.872 [0.807, 0.926] | 0.981 [0.931, 1.000] | 0.101 [0.064, 0.144] |
| Residual probe | 0.771 [0.687, 0.838] | 0.707 [0.627, 0.792] | 0.793 [0.709, 0.870] | 0.231 [0.166, 0.299] |
| SAE probe | 0.677 [0.582, 0.763] | 0.652 [0.563, 0.735] | 0.712 [0.616, 0.812] | 0.268 [0.193, 0.350] |
Judge disagreement
The two judges agree on 110 of 150 claims. Cohen's kappa is 0.572 [0.456, 0.682]. Disagreement clusters around the rubric boundary. Stratifying by GPT-5.4's label shows the hard cases.
| GPT-5.4 label | n claims | Kimi agrees | Kimi disagrees | Agreement rate |
|---|---|---|---|---|
| true | 62 | 54 | 8 | 0.871 |
| partially_true | 68 | 41 | 27 | 0.603 |
| false | 20 | 15 | 5 | 0.750 |
| full set | 150 | 110 | 40 | 0.733 |
The judges agree most on true claims, less on false claims, and least on partially-true claims. That boundary drives kappa down toward 0.57. The 110-claim agreement subset gives the cleanest MAUD slice because it removes labels where two judge families disagree.
On the 110-claim agreement set, every strong method gets cleaner: GPT-5.4 scoring rises to 0.981, the residual probe rises to 0.793, the SAE probe rises to 0.712. A probe fitting GPT-5.4's quirks should not improve when evaluation keeps only claims another judge family accepts. The improvement fits the interpretation that probes track signal surviving judge disagreement.
FActScore probe diagnostics
FActScore changes everything except the probed model: human atomic-fact labels instead of LLM judges, ChatGPT generations instead of Llama generations, biography factuality instead of merger-agreement QA, atomic facts pooled to parent sentence spans instead of extracted legal claims. Despite all of that, the residual probe's held-out AUROC lands at 0.802. The diagnostic question is whether that test number reflects a real generalization or a happy validation set.
The Brier scores tell the same story without the rank-only filter. Self-report Brier is 0.298, which means the probabilities Llama emits are barely better than always saying 0.5; the residual probe Brier is 0.171, reaching well past calibration parity. AUROC and Brier disagreeing would be a flag for rank-vs-magnitude trade-off; here they agree, and the probe wins on both.
The boundary: labels can measure coverage
FELM-wk gives the cautionary case. The minimum-viable run reaches 0.652 residual probe AUROC against 0.511 for self-report, with paired delta +0.141 [-0.063, 0.320]. The point estimate is positive. The interval crosses zero.
The useful FELM finding came from the annotation repair. Llama-generated FELM-style answers exceed FELM's curated reference snippets often enough that scoring them false against those snippets targets the wrong object. I added fuller readability-extracted references and a fourth not_enough_evidence label. The repaired pilot's 52 segments came back like this:
A 57.7% not_enough_evidence rate exposes the target mismatch. Generated claims can exceed the reference scope; then a reference-bounded annotator cannot adjudicate them. The label tells me whether the bundle supports the claim, not whether the claim is true. That gives a coverage measurement, not a correctness measurement, and it is the wrong target for a probe trained on claim correctness. I did not scale the matched-generation FELM run; FELM stays in the paper as a documented boundary for label provenance.
The methodological lesson
Label provenance dominates the transfer story. MAUD works because fixed claims are judged by a clear rubric. FActScore works because human atomic-fact labels capture factual support. FELM is weaker because segment labels and reference coverage are coarser than the claim-level correctness target the probe needs. Claim construction and label provenance affect probe trainability as much as the probe architecture does.
A probe-based uncertainty system starts before the classifier runs. Define the claim. Name what the label captures. Align the activation span with the label unit. With those three pieces aligned, a regularized linear readout of the residual stream recovers a sizeable correctness signal. With a muddier target, a stronger probe mostly learns the mud.
Bounds on the claim
- Claim: residual activations recover claim-level correctness signal that structured Llama self-report misses on MAUD and FActScore. The paired AUROC delta lands at +0.260 and +0.262 across the two settings.
- Limit: probes do not detect legal or factual truth in an absolute sense. MAUD remains judge-proxy evidence until the frozen 30-claim human audit is analyzed.
- Claim: the GPT-5.4 external scorer is the strongest MAUD method, but its 0.944 AUROC inflates from same-family scorer-judge coupling. The defensible estimate is 0.872 under the independent judge.
- Limit: residual probes have not cleared SAE probes. Residual is stronger in the point estimate, but residual-minus-SAE intervals cross zero under judge 2 and on the agreement set.
- Limit: the residual correctness direction does not decompose into a small number of Goodfire layer-19 SAE features. A separate analysis (probe_interpretation.md) found the signal distributed; the probe is a useful behavioral readout rather than a compact mechanism story.
- Limit: I have not tested other model families, other layers, or other claim extractors. All probes are layer 19 of Llama 3.1 8B Instruct. CUAD failed at the claim-extraction step before this protocol could run.
Next work
Three follow-ups would most sharpen the claim, in roughly this order.
Close the human-audit loop
The 30-claim MAUD audit packet is frozen at data/annotations/maud_human_audit_packet.jsonl. Completed expert labels would convert MAUD's claim from "stable across two LLM judges" to "stable across two LLM judges, anchored to expert legal labels," or expose a systematic disagreement worth knowing about. Either outcome is publishable.
Matched-generation transfer
FActScore reduces the generation-mismatch worry while leaving some of it open. A matched-generation run on a dataset where Llama's claims stay inside the reference scope would close that gap. The FELM repair shows the failure mode to avoid: reference-bounded labels driven into not_enough_evidence.
Layer sweep and second model family
All probes are layer 19, chosen because the Goodfire SAE was trained there. A small layer sweep would document how localized the signal is. A second model family at 7-13B parameters would say whether the gap is a Llama representation quirk or a more general property of instruction-tuned residual streams.
The result is a bounded interpretability win. A linear readout of hidden activations sees something the model's own structured confidence misses, and the gap is the same size on legal QA and on biography factuality. I do not claim a compact SAE feature story, a legal-truth adjudicator, or a domain-general hallucination detector.
The remaining questions sit on label provenance and matched-generation transfer, not on probe architecture. Downstream tools building on this should expose claim construction and label-provenance assumptions to users rather than hiding them behind opaque confidence scores. The probe is only as honest as the labels it was trained on.