LLM Thematic Generalization Benchmark V2: models see 3 examples, 3 misleading anti-examples, and 8 candidates with exactly 1 true match, but the underlying theme is never stated. The challenge is to infer the specific hidden rule from those clues rather than fall for a broader, easier pattern.

r/singularity
Generative AI AI Research

Info: Example benchmark item: Examples: - a surveyor's leveling rod - a fishpole microboom - a submarine periscope housing Anti-examples: - a coiled steel measuring tape - a folding wooden carpenter's rule - a retractable cord dog leash Correct candidate: - a collapsible stainless steel drinking straw Incorrect candidates: - a screw-type automobile jack - a folding aluminum step ladder - a kaleidoscope viewing tube - a pair of hinge-folding opera glasses - a flexible silicone drinking straw - a drawer glide rail mechanism - a cardboard box periscope Theme: - physical objects that extend.