Looks awesome. I work closely with Multimodal search and have had trouble porting CLIP to ONNX and other formats due to the lack of multi-head attention operators. Are you using Python for the CLIP inference, or did you manage to port it to a format hostable in a Rust or C/C++ inference runtime?

Yes, we were able to port the CLIP model to work with ONNX Runtime for inference

May I ask, which version or ORT are you using? Were the outputs identical to PyTorch outputs for the same image?