> possibly have two of them on one board.

That would involve NUMA, and your memory bandwidth for cross-chip compute would probably suck. Would that even beat a simple cluster in performance?