I like the idea but you could take it a step further and have just a core virtual machine that you could attach virtual (input/output) devices to. So then the canvas and audio would just be virtual devices that met some specification. Or say for example, you just want to listen to an audio playlist, you could attach an audio device, a keyboard and a terminal device (for feedback). A canvas device wouldn't necessarily be required (if there was no use for one). And it would be up to the user to attach the devices required by an application, or at least the user would have direct control.

TLDR: QEMU but much simpler and only WASM need be supported.

Yes, but it also would be good to have some dumbed down version of HTML/DOM/CSS, so that the text can be copied and accessibility works.