MLX 16/8/4/2-bit quants of nvidia/llama-embed-nemotron-8b
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
AI Tools
I converted nvidia/llama-embed-nemotron-8b to MLX fp16, 8-bit, 4-bit, and 2-bit (for my OCD) and put it on HuggingFace: ncorder/llama-embed-nemotron-8b-mlx-fp16 ncorder/llama-embed-nemotron-8b-mlx-8bit ncorder/llama-embed-nemotron-8b-mlx-4bit ncorder/llama-embed-nemotron-8b-mlx-2bit I was running this model using GGUFs + llama-server for local semantic search over an Obsidian vault and some other projects. It worked fine but I got tired of managing a whole HTTP server just for embeddings and also wanted Apple Silicon optimizations.