Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness

ArXi:2604.00672v1 Announce Type: new TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter.