What small models are you using for background/summarization tasks?
r/LocalLLaMA
•
AI Hardware
Open Source AI
I'm experimenting with using a smaller, faster model for summarization and other background tasks. The main model stays on GPU for chat and tool use (GLM-4.7-flash or Qwen3.5:35b-a3b) while a smaller model (Qwen3.5:4b) runs on CPU for the grunt work. Honestly been enjoying the results. These new Qwen models really brought the game - I can reliably offload summarization and memory extraction to the small one and get good output. Thinking of experimenting with the smaller models for subagent/a2a stuff too, like running parallel tasks to read files, do research, etc.