Hacker News

It depends on what you're comparing. If the same model fits on the combined VRAM but not on a single contiguous VRAM, then it won't be faster to run two instances of it. If you're comparing a 23 GB model running duplicated vs a 46 GB model running split, then yeah, that will likely be faster, just because there's no synchronization between cards.

AFAIUI, there'd be little advantage in having a higher speed inter-card connection, because the cards don't really talk to each other during inference. The loss of efficiency compared to a monolithic memory architecture comes from scheduling, not from data transfer.