I'm no expert on cameras, but this seems to explain it:

https://en.wikipedia.org/wiki/Four-tube_television_camera

This says the main reason for four-tube is image registration. Basically, the incoming light is optically split up into red, green, and blue, and these go to separate sensors. Being separate sensors, they may not be physically aligned. So if you sum R + G + B together to get luminance, the picture will not be sharp.

You can solve this by adding a fourth (black and white) tube for luminance. Since it's just one tube, there is no alignment issue for the luminance part of the picture. And the eye is less sensitive to color, so while the color alignment issues remain, they aren't very noticeable.

At first when I read this, I assume the camera must have to somehow combine the fourth tube's signal with the other three tubes' signals. But since both NTSC and PAL encode luminance and chrominance separately, apparently this isn't necessary. It's the TV that combines them. With a three-tube camera, the sensor signals have to be split into luminance and chrominance. With a four-tube camera, you just take luminance from one sensor and chrominance from the other three sensors.