The README.md is 9k of dense text, but does explain it: faster, more efficient, more accurate & more sensible.
Rust port feature: The implementation "passes 93.8% of Mozilla's test suite (122/130 tests)" with full document preprocessing support.
Test interpretation/sensibility: The 8 failing tests "represent editorial judgment differences rather than implementation errors." It notes four cases involving "more sensible choices in our implementation such as avoiding bylines extracted from related article sidebars and preferring author names over timestamps."
This means that the results are 93.8% identical, and the remaining differences are arguably an improvement.
Further improvement, extraction accuracy: Document preprocessing "improves extraction accuracy by 2.3 percentage points compared to parsing raw HTML."
Performance:
* Built in Rust for performance and memory safety
* The port uses "Zero-cost abstractions enable optimizations without runtime overhead."
* It uses "Minimal allocations during parsing through efficient string handling and DOM traversal."
* The library "processes typical news articles in milliseconds on modern hardware."
It's not explicitly written but I think it's a reasonable assumption that its "millisecond" processing time is significantly faster than the original JavaScript implementation based on these 4 points. Perhaps it's also better memory wise.
I would add a comparison benchmark (memory and processing time), perhaps with barcharts to make it more clear with the 8 examples of the differing editorial judgement for people who scan read.
So why? The link just goes to the project GitHub repo, and README does not explain as far as I can see.
The README.md is 9k of dense text, but does explain it: faster, more efficient, more accurate & more sensible.
Rust port feature: The implementation "passes 93.8% of Mozilla's test suite (122/130 tests)" with full document preprocessing support.
Test interpretation/sensibility: The 8 failing tests "represent editorial judgment differences rather than implementation errors." It notes four cases involving "more sensible choices in our implementation such as avoiding bylines extracted from related article sidebars and preferring author names over timestamps."
This means that the results are 93.8% identical, and the remaining differences are arguably an improvement. Further improvement, extraction accuracy: Document preprocessing "improves extraction accuracy by 2.3 percentage points compared to parsing raw HTML."
Performance:
It's not explicitly written but I think it's a reasonable assumption that its "millisecond" processing time is significantly faster than the original JavaScript implementation based on these 4 points. Perhaps it's also better memory wise.I would add a comparison benchmark (memory and processing time), perhaps with barcharts to make it more clear with the 8 examples of the differing editorial judgement for people who scan read.