You should be able to do it all on-device, check out SAM, the Software Automatic Mouth. The actual data in the *_tabs files:

https://github.com/ctoth/SAM/tree/master/src

The sound in the video seems more sophisticated than TTS. It seems more like the result of analyzing a clip of digital audio, and turning it into a series of TTS phonemes.

Assuming SAM is a faithful port of the original, it converts text into phonemes according to a bunch of pronunciation rules.