Qwen image layered VS multimodal agentic solutions

Im looking for good api solution for decomposing ui images and screenshots to layers. I saw Qwen image layered and multimodal agentic products like this Now I have a feeling both are not good enough but at least the last one is robust for many use cases. I think it uses stronger models. Any ideas? Any other solutions? Really need that for large project. submitted by /u/Upstairs-Breakfast49 [link] [comments]