it actually sounds like a fun idea, but i have one question. do you think a lightweight CNN trained on synthetic candy layouts would outperform the deterministic decoder for messy real world photos?

Yes, for messy real-world photos a lightweight CNN would probably outperform the deterministic decoder, but I’d still use it in a hybrid pipeline with classic CV for blob detection and deterministic logic for reconstructing the actual program.