is there a local model that can follow instructions and an image input?

With Gemini (commercial), I can feed it an image and instruct the prompt to rotate the camera around the subject 90 degrees and it'll generate a plausible image where it had to make up a new perspective of the subject and background. Gemini does this as well as can be expected but has limitations like