You can do that with smaller models at home. Gemma-4-E4B will run on a 12gb GPU, and supports audio, image, video input
12GB GPU is a lot
12GB GPU is a lot