The article says that the author wanted to have an analysis done, but ruled out the option of a human doing it. He did not rule out using a multimodal LLM. It's possible that he just is unaware that multimodal LLMs are capable of doing an analysis. Considering the approach wasn't mentioned in the article one can guess that the author just didn't know it was possible. But there is no way to know for sure.