AI voice generation has a workflow problem, not just a quality problem

Most discussion around AI voice tools focuses on model quality. How natural is the voice? How good is cloning? Can it handle emotion? Can it speak multiple languages? Those things matter, but I think the bigger unsolved problem is workflow. Generating one short voice clip is easy now. The hard part starts when someone wants to make something longer: a podcast draft audiobook chapter