[Architecture Help] Serving Embed + Rerank + Zero-Shot Classifier on 8GB VRAM. Fighting System RAM Kills and Latency.

r/LocalLLaMA
Machine Learning Generative AI MLOps

Hey everyone, I’ve been banging my head against the wall on this for a few weeks and could really use some architecture or MLOps advice. I am building a unified Knowledge Graph / RAG service for a local coding agent. It runs in a single Docker container via FastAPI. Initially, it ran okay on Windows (WSL), but moving it to native Linux has exposed severe memory limit issues under stress tests.