AI RESEARCH

KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving

arXiv CS.AI

ArXi:2604.16682v1 Announce Type: cross Power has become a central bottleneck for AI inference. This problem is becoming urgent as agentic AI emerges as a major workload class, yet prior power-management techniques focus almost entirely on single-turn LLM serving. Our analysis shows that agentic serving behaves fundamentally differently: each request carries long-lived context that evolves across tool-interleaved turns, and lowering GPU frequency can push the system into a thrashing regime where memory pressure sharply worsens both performance and power efficiency.