Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning

ArXi:2502.01968v3 Announce Type: replace-cross Recent studies show that in supervised fine-tuning (SFT) of large language models (LLMs), data quality matters than quantity. While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens within a sample can vary significantly. After pre-