Running SmolLM2‑360M on a Samsung Galaxy Watch 4 (380MB RAM) – 74% RAM reduction in llama.cpp
r/LocalLLaMA
•
Generative AI
Open Source AI
I’ve got SmolLM2‑360M running on a Samsung Galaxy Watch 4 Classic (about 380MB free RAM) by tweaking llama.cpp and the underlying ggml memory model. By default, the model was being loaded twice in RAM: once via the APK’s mmap page cache and again via ggml’s tensor allocations, peaking at 524MB for a 270MB model. The fix: I pass host_ptr into llama_model_params, so CPU tensors point directly into the mmap region and only Vulkan tensors are copied.