This is what VLA models are for. They would work much better. Would need a bit of fine tuning but probably not much. Lots of literature out there on using VLAs to control drones.

Did some research, found a model that is exactly that. https://cognitivedrone.github.io/

The Black Mirror speedrun continues

Thanks will check this out!