AI RESEARCH

[P] Runtime GGUF tampering in llama.cpp: persistent output steering without server restart

r/MachineLearning

I built a small research PoC: llm-inference-tampering. It nstrates a runtime integrity risk in local inference setups using llama.cpp with default mmap behavior: if another process can write to the same GGUF file, generation behavior can be persistently altered during serving. High-level idea: llama-server maps GGUF weights from disk. The PoC modifies quantization scale values in output.weight for selected token rows. Those tokens become disproportionately likely in output. No ptrace, no process injection, no server restart required. Includes re flow to revert original values.