The input is images, and the output is CAD models, so it appears you could use a multi-modal LLM to natural language -> image -> CAD