I've been interested in real-time Human-AI interaction for a while. This project is a prototype closed-loop drawing system, like "visual autocomplete" for drawings. The idea is that the user just draws along with the AI, without disrupting the flow through manual text prompting.
It works by AI continually observing and responding to live drawing on a canvas. A vision model (using Ollama) interprets what it sees, and that description drives real-time image generation (StreamDiffusion).
For real-time performance, this project is built in C++ and Python, leveraging the GPU for Spout-based texture sharing with minimal overhead.
Reusable components include:
- StreamDiffusionSpoutServer: lightweight Python server for real-time image generation with StreamDiffusion. Designed for interfacing with any Spout-compatible software and uses OSC for instructions.
- OllamaClient: minimal C++ library for interfacing with Ollama vision language models. Includes implementations for openFrameworks and Cinder.
The "visual autocomplete" concept has been explored in recent papers (e.g., arxiv.org/abs/2508.19254, arxiv.org/abs/2411.17673).
Hopefully, these open source components can help accelerate others experimenting and advancing this direction!
It would be great if there is react/ts version instead of c++
Totally! This project had a slightly different focus, though, with it's emphasis on performance and standalone use with local Ollama models and Spout for local texture sharing across applications on the GPU. There's https://daydream.live/ which is a hosted streaming StreamDiffusion service, which might be possible to use for a web-based implementation.