> I'm still not sure what the best inference stack on Rust is
I was just looking into this today!
The options I've found, but yet to evaluate:
- TorchScript + tch = Use `torch.jit.trace` to create a traced model, load with tch/rust-tokenizers
- rust-bert + tch = Seems to provide slightly higher-level usage, also use traced model
- ONNX Runtime - Convert (via transformers.onnx) .pt model to .onnx encoder and decoder, then use onnxruntime+ndarray for inference
- Candle crate - Seems to have the smallest API for basic inference, and AFAIK can load up models saved with model.save() without conversion or other things
These are the different approaches I've found so far, but probably missed a bunch. All of them seem OK, but on different abstraction-levels obviously, so depends on what you want ultimately. If anyone know any other approach, would be more than happy to hear about it!
There's also the burn framework but there are a lot of tradeoffs to consider. It's neat for wgpu targets (including web) but you'll need to implement a lot of stuff.
Candle is a great choice overall (and there are plenty of examples) but performance is slightly worse compared to tch.
Personally, if I can get it done with candle that's what I do. It's also pretty neat for serverless.
If I can't, I check if I can convert it to onnx without extra work (or if there is an onnx available).
As a last resort, I think about shipping torchlib via tch.
Great resources, thanks. I'll look into the other packages and compare against our onnx runtime setup.