Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization [R]

Paper: I found this interesting as another iteration of the TRM approach: Shows decent OOD generalization in 2/3 tasks (but why does this fail >2x? and why is unstructured text so much worse?) Explains why intermediate step supervision can hurt generalization.