Running LLMs Locally: A Rigorous Benchmark of Phi-3, Mistral, and Llama 3.2 on Ollama

Abstract This report presents a comprehensive evaluation of three small language models (SLMs) - Llama 3.2 (3B), Phi-3 mini, and Mistral 7B - running locally via Ollama. A FastAPI-based benchmarking framework was developed to measure inference speed, resource consumption, and the models' ability to produce valid JSON outputs as defined by Pydantic schemas. A retry mechanism with reprompting was implemented to handle malformed responses. The models were tested on a suite of 30 prompts spanning general knowledge, mathematics, coding, reasoning, and creative writing.