What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

ArXi:2510.26202v2 Announce Type: replace-cross Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We