Thanks for sharing these contents. They are very interesting. I found "making all app features accessible from a textual interface..." actually quite challenging in cerntain domains such as graphics related editing tools. Though many editing functions can be exposed as CLI properly, but the content being edited is very hard to be converted into texts without losing its geometric meaning. Maybe this is where we truly need the multimodal models or where training on specialized data is needed.

> the content being edited is very hard to be converted into texts

For decades now, pro design print shops have required text files describing the design to print from.

And as every Danish pelican cyclist knows, graphics are their most scalable as text vectors.

Inkscape does fine with these.