Maybe we need some agent running on the PC to offload some of these tasks. It could scrape the display at 30 or 60 Hz and produce a textual version of what's going on for the model to consume.