Can models gradient hack SFT elicitation?

TL;DR: Using evidence from tamper resistance, we argue that it would be hard for current models to gradient hack SFT elicitation. Suppose you want to conservatively measure the dangerous capabilities of your LLM and will decide whether to deploy it based on these capabilities, but you’re worried that the LLM might sandbag your evaluation. In the best case, you have a dataset of question-answer pairs, which you can use for supervised fine-tuning (SFT) elicitation on a