Llama 8B matching 70B on multi-hop QA with structured prompting, no fine-tuning

Ran a bunch of experiments with Graph RAG (KET-RAG) on multi hop question answering. Turns out retrieval is basically solved, the answer is in the context 77 to 91% of the time. The bottleneck is reasoning: 73 to 84% of wrong answers come from the model failing to connect the dots, not from missing information. Smaller models choke on the reasoning even when the answer is sitting right there in the context.