This is very interesting. I'm a bit skeptical about the benchmarks / performance claims because they seem almost too good to be true but even just the extended operators alone are a nice improvement over existing regex engines.
The post mentions they also have a native library implemented in Rust without dependencies but I couldn't find a link to it. Is that available somewhere? I would love to try it out in some of my projects but I don't use .NET so the NuGET package is of no use to me.
There's currently only a string solver with the same core library, but not a full regex engine https://github.com/ieviev/cav25-resharp-smt
I will open source the rust engine soon as well, some time this month.
As for the benchmarks, it's the fastest for large patterns and lookarounds, where leftmost-longest lets you get away with less memory usage so we don't need to transition from DFA to NFA.
In the github readme benchmarks it's faster than the exponential implementations of .NET Compiled so the 35 000x can be an arbitrary multiplier, you can keep adding alternatives and make it 1000000x.
for a small set of string literals it will definitely lose to Hyperscan and Rust regex since they have a high effort left-to-right SIMD algorithm that we cannot easily use.
> for simple string literals it will definitely lose to Hyperscan and Rust regex since they have a high effort left-to-right SIMD algorithm that we cannot easily use
I think "simple string literals" undersells it. I think that description works for engines like RE2 or Go's regex engine, but not Hyperscan or Rust regex. (And I would put Hyperscan in another category than even Rust regex.) Granted, it is arguably difficult to be succinct here since it's a heuristic with difficult-to-predict failure points. But something like: "patterns from which a small number of string literals can be extracted."
yes, that is correct. also Rust's engine matches the full unicode spec as individual characters, whereas .NET's will chop emojis into two sometimes, so Rust at a disadvantage here.
something i've been also wondering is how does Harry (https://ieeexplore.ieee.org/document/10229022) compare to the Teddy algorithm, it's written by some of the same authors - i wonder if it's used in any engines outside of Hyperscan today.
Would SearchValues<char> help there for a fallback to a SIMD optimized simple string literal search rather than the happy path?
Yes, that's exactly what we did to be competitive in the benchmarks.
There's a lot of simple cases where you don't really need a regex engine at all.
integrating SearchValues as a multi-string prefix search is a bit harder since it doesn't expose which branch matched so we would be taking unnecessary steps.
Also .NET implementation of Hyperscan's Teddy algorithm only goes left to right.. if it went right to left it would make RE# much faster for these cases.
So, there's still room for significant improvement.
There is plenty still to do.
One part of this is SIMD algorithms to better compete with Hyperscan/Rust, another is the decades of optimizations that backtracking engines have for short anchored matches for validation.
There's analysis to do for specific patterns so we can opt for specialized algorithms, eg. for fixed length patterns we skip the left-to-right pass entirely since we already know the match start + match length.
Lots of opportunistic things like this which we haven't done. Also there are no statistical optimizations in the engine right now. Most engines will immediately start looking for a 'z' if there is one in the pattern since it is rare.
I second this request. It would be wonderful to be able to test the rust implementation since it easier to call from other languages in my typical setup. I have a couple uses cases I have never fully resolved, just implemented partial work arounds and accepted a restricted feature set. This would probably allow me to deal with them correctly.