tested 4 local models on iphone - benchmarks + the 9.9 vs 9.11 math trick

Did a local LLM benchmark on my ipro max last night. tested 4 models, all Q4 quantized, running fully on-device with no internet. first the sanity check. asked each one "which number is larger, 9.9 or 9.11" and all 4 got it right. the reasoning styles were pretty different though.