This seems like a suitable job for a small language model. Bit biased since I just read this paper[0]
[0] https://research.nvidia.com/labs/lpr/slm-agents/