If you were interested in co-designing a CPU with software the TTA is an attractive way to do it, particularly in that it is easy to design it so you can do more than one MOV at the same time and thus have explicit parallelism.

The tough part though is that memory is usually slow and you have to wait an undetermined number of cycles for data to get back from DRAM and while one operation is blocked all the other operations are blocked.

I guess you could have something like this with a fancy memory controller that could be programming explicitly to start fetching ahead of time so data is available when it is needed, at least most of the time.

I played around making a TTA-ish thing as part of learning Verilog some years ago. It's a neat idea: https://github.com/rdaum/simple_tta

Exactly, it's easier than developing a CPU the normal way and offers the possibility of making something that has unique capabilities as opposed to the mostly boring option of revisiting the Z-80 [1] or the near certainty of getting bogged down trying to implement a modern high performance CPU and getting pipelining, superscalar and all that to work.

[1] with the caveat that extending that kind of chip to support a larger address space and simple memory protection is interesting to me

How would you handle context switching? You've got a whole lot of exposed state scattered throughout the whole CPU.

By not doing it. The ideology here is that general purpose computing took numerous wrong turns from the 1950s to the present for the purpose of embedded systems.

I thought this through back when I was doing embedded projects with the AVR-8, namely display controllers for persistence of vision displays. Something like this doesn't have an OS so you don't need to do context switching for the purposes of the OS.

It was practical to write C code for this but I didn't really like it because code like this doesn't need the stack and the affordances that C calling conventions, the data structures needed to display a scene are dynamic with the scope of the scene, you have 32 registers which is a lot, enough that you can allocate 8 for the interrupt handler and have a lot left over for the main loop.

I was wargaming my paths forward if I needed more power: the obvious route which I probably would have taken is the portable C route via ARM or STM32. Yet I liked AVR-8 a lot and also considered the route of going to an FPGA board on which you could instantiate an AVR-8 soft core clocked higher than any real hardware AVR-8 and also put an accelerator behind it.

The FPGA + TTA + co-designed software route came up at this point. Notably any kind of concurrency, parallelism and extra context can be baked into the "hardware". Adding a few registers is much cheaper than adding superscalar features, adding another MOV slot to the instructions then is pretty cheap if you want more parallelism with the caveat that it could be hard to prevent blocking. If the requirements change it's a frickin' FPGA and you can add something to it or take something away.

What would put the whole idea on wheels is a superoptimizing compiler that could design both the CPU and the code that runs on top of it.

I would just have multiple cores, and communication between them happens over a central shared hub, like the Parallax Propeller MCUs. If you want concurrency, push your new task onto a separate core.

Still the problem is writing a compiler for such a system would suck.