> Xorshift128+ etc are around 10 to 30 times faster than ChaCha20.

What methods, what CPU? Is that using chacha20 a couple bytes at a time? If you generate your random bytes in medium size blocks you'll probably see a much smaller difference.