I spent a lot of time on systolic arrays to compute crypto currency POW (Blake 2 specifically). It’s an interesting problem and I learned a lot but made no progress. I’ve often wondered if anyone has done the same?

You should check out AMD's NPU architecture.