Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

ArXi:2604.22154v1 Announce Type: cross Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression. However, common evaluation approaches, like LLM-as-a-judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety-critical settings.