Show HN: Efficient LLM Architectures for 32GB RAM (Ternary and Sparse Inference)

Hacker News Show AI
Generative AI AI Hardware

Hi HN, I’ve been exploring how far large language models can be pushed on machines with limited memory. I built an experimental runtime and architecture approach focused on making extremely large models feasible on systems with around 32GB of RAM. The core idea is combining several efficiency techniques: ternary weight representation {-1, 0, +1} (~1.58 bits per weight), sparse execution that skips zero weights, memory-mapped layer streaming from NVMe storage, and lightweight tensor unpacking optimized for Apple Silicon.