Mainly because global video data corpus is > 100k larger than global text corpus, so you will need to train much larger models for much longer (than current LLMs).