[R] KALAVAI: Predicting When Independent Specialist Fusion Works (gain = 0.82 × divergence − 2.72, R² = 0.856, tested 410M–6.9B)

Hey all, I've been working on this for a few months and just put the paper on arXi: Project page: Code + scripts: The basic idea: take a base checkpoint, give copies to a bunch of people, each person fine-tunes on their own domain or language independently (no communication, no shared gradients, nothing), then you collect all the checkpoints and train a lightweight MoE router on top in about 500 steps. The fused model beats every individual specialist. I tested this at 410M, 1B, and 6.9B on Pythia.