There's just a certain amount of things the image encoder can process at once. It's pretty apparent when you give the models a big table in an image.

But isn't this basically what the conv layer does...?