AI RESEARCH
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
arXiv CS.LG
•
ArXi:2605.03441v1 Announce Type: cross Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using formalisms such as set theory, formal logic, and quantum mechanics -- bypasses these filters at high rates, achieving 46%--56% average attack success across eight target models and two established benchmarks.