SycoFact 4B - Open model for detecting sycophancy & confirmation of delusions, 100% on psychosis-bench, generates feedback for model training, trained without human labels
r/LocalLLaMA
•
Machine Learning
Open Source AI
I published a model you can use now to help detect sycophantic AI responses. It rejects 100% of the sycophantic delusion affirming responses from psychosis-bench. It also does well on the AISI Harmful Advice, PKU-SafeRLHF, and safety subsets of RewardBench. It's only 4B parameters, so it's of particular use for