Yes this is another important difference between human auditory perception and classical signal processing algorithms. Typically when processing audio we take a Fourier transform and then throw away the phase information. Mostly the amplitude information is all you need to understand a sound, but the ear actually is capable of picking up phase information.

(I thought this was discussed at some point in Lyon's book but it's admittedly been many years since I read it, so I can't remember for sure.)

What does that mean, though? If you invert the sign of a waveform it sounds the same, so if it's not picking that up, what phase and relative to what does it pick up?

This doesn't directly answer your questions, but I can share one example where our sensitivity to phase information becomes apparent: when you play the sound of a pair of hands clapping. It's a great test case for hearing when a set of speakers has phase alignment problems, which is underappreciated as a metric for speaker systems. (It's rare even for experts to put a lot of effort into uniform phase response when designing speakers. It's one of the hardest things to manage. Frequency response, harmonic distortion levels, dispersion and even aesthetics usually take priority.)

Likewise, if you mess with the phase alignment of different frequencies in a hand clap sample and play it through an otherwise phase-coherent source like ear buds or headphones, the misalignment is really obvious.

The relative phase between the different frequencies. You're correct that the ear can't pick up a global phase change.

As an extreme example, consider a delta function: there is silence, then a brief spike, and then silence again. If you're just looking at the amplitudes of the various frequency components of this signal it is indistinguishable from white noise. The only thing that makes this signal look (and sound) different from white noise is the relative phase between the different frequency components. The ear's ability to detect these phase synchronicities helps it to pick out "peakiness" in waveforms more easily. (This is, in turn, important for understanding consonants in speech, which is extremely important for intelligibility, particularly in noisy environments.)