the performance was pretty much identical, however, as an added benefit, Blake2 got quite a bit faster due a combination of 1) our code being slightly more optimized, and 2) python's blake2 integration not actually doing runtime cpu detection, meaning that unless you compiled with -march=native (like, Gentoo), at the time, you were not getting the AVX2 version of Blake2 within Python -- my PR fixed that and added code for CPU autodetection
bear in mind that performance is tricky -- see my comment about NEON https://github.com/python/cpython/pull/119316#issuecomment-2...