Hacker News

Y

Hacker News

new | ask | show | jobs

reilly3000 2 days ago [ - ]

It has to be 2^n nodes and limited to one per attention head that the model has.