The repository already runs an ONNX model. But the onnx model doesn't get English text as input, it gets tokenized phonemes. The prepocessing for that is where most of the dependencies come from.
Which is completely reasonable imho, but obviously comes with tradeoffs.
For space sensitive applications like embedded systems, could you shift the preprocessing to compile time?
You would need to constrain the vocabulary to see any benefits, but that could be reasonable. For example, you an enumeration of numbers, units and metric names could handle dynamic time, temperature and other dashboard items.
For something more complex like offline navigation, you already need to store a map. You could store street names as tokens instead of text. Add a few turn commands, and you have offline spoken directions without on device pre-processing.