Are the results of the matmuls really that far apart in size that you have to lose significant bits when adding them up at FP32?