Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?

ArXi:2604.17930v1 Announce Type: cross Large Language Models (LLMs) exhibit a puzzling disparity in their formal linguistic competence: while they on trillions of tokens. In this work, we investigate whether these failures stem from inherent architectural limitations or simply the scarcity of these specific grammatical constructions in web-scale corpora. We pre-train simple GPT-2 Small (124M) models on a 100M-token random sample of the FineWeb corpus and intervene by injecting a minimal amount (1%) of synthetic data targeting specific linguistic phenomena.