text based is too slow and too much typing, it ll take off only if it is real time voice based

yeah sure, we'll add it to the roadmap. But do you think "speaking"/"typing" the basic instructions is better than actually doing it through the UI?

I feel like for basic interactions like dragging etc, it is better if the user does it by hand. AI can handle complicated workflows like removing silences, quickly removing unwanted background elements etc