I would just have multiple cores, and communication between them happens over a central shared hub, like the Parallax Propeller MCUs. If you want concurrency, push your new task onto a separate core.

Still the problem is writing a compiler for such a system would suck.