Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs

ArXi:2604.27006v1 Announce Type: cross Context: Study screening in systematic literature reviews is costly, inconsistency-prone, and risk-asymmetric, since false negatives can compromise validity. Despite rapid uptake of Large Language Models (LLMs), there is limited evidence on how such models behave during the study screening phase, particularly regarding the choice of specific LLMs and their comparison with classical models.