Hacker News

testbjjl 8 hours ago [ - ]

DeepSeek interpreting screenshots and images I send it at fractions of what I pay Claude and ChatGPT, for me, is of far higher priority than supporting dictation. There are workarounds for dictation but not image processing.

segmondy 2 hours ago [ - ]

You can do that with smaller models at home. Gemma-4-E4B will run on a 12gb GPU, and supports audio, image, video input

NooneAtAll3 9 minutes ago [ - ]

12GB GPU is a lot

anthonypasq 6 hours ago [ - ]

just use one of the various cheap gemini models

freedomben 3 hours ago [ - ]

Indeed, Gemini really is incredible at image analysis. Yesterday I pointed it at some sloppy handwritten notes and asked it to add up the numbers in the right column, and it did it no problem. I've also used it to find out what TV show or actor is on screen, and various other things. It's quite impressive.

winstonp 2 hours ago [ - ]

Gemini pretty clearly has the best underlying model, and the worst RL and post-training of the lot.

carterschonwald 4 hours ago [ - ]

gemini models are also fantastic at understanding non spoken sounds

corimaith 6 hours ago [ - ]

Or you could just use a CNN...

bigmadshoe 6 hours ago [ - ]

CNNs are not SoTA anymore when it comes to large models, and also are not used to provide interpretations of images as text, but rather to classify, do semantic segmentation, etc.

bonoboTP 2 hours ago [ - ]

CNNs are fine when trained with a good recipe. There are very few good studies comparing them with proper hyperparam search and all the training tricks applied consistently. Transformers are good but ViT vs CNN is not some settled issue. Transformers are more hyped and more popular with the tech enthusiasts who just read forums and news, but if you need stuff done, CNNs are still great.

bigmadshoe an hour ago [ - ]

I agree, but since we're talking about imagine understanding with text output, clearly a CNN is unsuitable. My previous comment was overly reductive and CNNs can still be SoTA depending on your performance metrics. I spent the earlier part of my career training CNNs, and they are very pleasant to work with.

tehjoker 4 hours ago [ - ]

Can you say more about that? I haven't kept up.

crypto420 an hour ago [ - ]

CNNs excel in vision tasks where you have limited compute, limited memory, limited data, and want something that works super well and quick. People usually don't hook CNNs up to a transformer to get language understanding either, you have to train bespoke CNNs for specific tasks

ViTs excel where you're unbounded in compute + data and also want text understanding or have a conversation about an image

Jabrov 5 hours ago [ - ]

Transformers are superior

nullstyle 6 hours ago [ - ]

Which?