Which LLMs actually fail when domain knowledge is buried in long documents?

I’ve been testing whether frontier LLMs can retrieve expert industrial knowledge (sensor-failure relationships from ISO standards) when the relevant information is buried inside long documents. The interesting pattern so far: DeepSeek V3.2 answers the questions correctly in isolation but fails when the same question is embedded in a long context. Gemma 3 27B fails on the domain knowledge itself, regardless of context.