[P] Built confidence scoring for autoresearch because keeps that don't reproduce are worse than discards

Been running autoresearch for about a week. ~100 experiments per night on an H100. The keep rate is around 15%, which matches what Karpathy posted in his own discussion threads ( and ). The problem isn't the keep/discard loop. That works. The problem is that some of those keeps don't hold up. Karpathy's session shows that 5% warmup (a keep in session ) actually hurt performance when run again. A 0.02% improvement in val_bpb could be a real win or GPU nondeterminism. After extended runs it gets worse: 68 experiments for a single keep.