[R] Forced Depth Consideration Reduces Type II Errors in LLM Self-Classification: Evidence from an Exploration Prompting Ablation Study - (200 trap prompts, 4 models, 8 Step-0 variants) [R]

LLM-Based task classifier tend to misroute prompts that look simple at first glance, but require deeper understanding - I call it "Type II Error" here. Setup TaskClassBench, a custom benchmark of 200 effective trap prompts (context-contradiction + disguised-correction categories) designed to create a mismatch between surface simplicity and contextual complexity. For example: S ystem context establishes a fault-tolerant ETL pipeline with retry logic, dead-letter queues, and alerting.