A Hierarchical End-of-Turn Model with Primary Speaker Segmentation for Real-Time Conversational AI

ArXi:2603.13379v1 Announce Type: new We present a real-time front-end for voice-based conversational AI to enable natural turn-taking in two-speaker scenarios by combining primary speaker segmentation with hierarchical End-of-Turn (EOT) detection. To operate robustly in multi-speaker environments, the system continuously identifies and tracks the primary user, ensuring that downstream EOT decisions are not confounded by background conversations.