I wonder if the difficulties LLMs have with “seeing” complex detail in images is muddying the problem here. What if you hand it the cube state in text form? (You could try ascii art if you want a middle ground.)
If you want to isolate the issue, try getting the LLM itself to turn the images into a text representation of the cube state and check for accuracy. If it can’t see state correctly it certainly won’t be able to solve.