The sound in the video seems more sophisticated than TTS. It seems more like the result of analyzing a clip of digital audio, and turning it into a series of TTS phonemes.

Assuming SAM is a faithful port of the original, it converts text into phonemes according to a bunch of pronunciation rules.