Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

ArXi:2604.24819v1 Announce Type: cross Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-tuning on domain corpora has enabled substantial capability gains, but the process operates without feedback: when a model fails on a domain task, there is no method to diagnose what is deficient in the