Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

ArXi:2605.10901v1 Announce Type: new Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because "harmful behavior" has no natural specification in a discrete input space: and the standard epsilon-ball properties used in other domains do not carry semantic meaning.