Are the results of the matmuls really that far apart in size that you have to lose significant bits when adding them up at FP32?
Are the results of the matmuls really that far apart in size that you have to lose significant bits when adding them up at FP32?