Was there a specific reason for this choice?

1x1 convolution is the most lightweight operator for transforming features into outputs.

3x3 convolution is the most common operator used to provide basic computational power.