Show HN: Efficient LLM Architectures for 32GB RAM (Ternary and Sparse Inference)
Hacker News Show AI
•
Generative AI
AI Hardware
Hi HN, I’ve been exploring how far large language models can be pushed on machines with limited memory. I built an experimental runtime and architecture approach focused on making extremely large models feasible on systems with around 32GB of RAM. The core idea is combining several efficiency techniques: ternary weight representation {-1, 0, +1} (~1.58 bits per weight), sparse execution that skips zero weights, memory-mapped layer streaming from NVMe storage, and lightweight tensor unpacking optimized for Apple Silicon.