Because it's far more reliable to use proper parsers instead of a bunch of regular expressions. Most languages cannot be properly parsed with regexes.
Those files are compiled tree-sitter grammars, read up on why it exists and where it is used instead of me poorly regurgitating official documentation:
Funny enough, they are less than 10MB when compressed. I guess they could use something like upx to compress these binaries.
The whole Linux release is 15mb, but it uncompresses to 16MB binary and 200MB grammars on disk.
Why do we need to have 40MB of Verilog grammars on disk when 99% of people don't use them?
That would waste CPU time and introduce additional delays when opening files.
They could probably lazily install the grammars like neovim does, but as someone who doesn't have much faith in the reliability of internet infrastructure, I'll personally take it...
Just ran `:TSInstall all` in neovim out of curiosity, and the results were predictable:
If disk space is important for your use case, I guess filesystem compression would save far more than just compressing binaries with upx. btrfs+zstd handle those .so well:I mean, they could decompress it once when using a language for the first time. It will still be fully offline, but with a bit uncompressing.
If this is a concern, why not compress at the filesystem level?
For real parsing a proper compiler codebase (via a language server implementation) should be used. Writing something manually can't work properly, especially with languages like C++ and Rust with complex includes/imports and macros. Newer LSP editions support syntax-based highlighting/colorizing, but if some LSP implementation doesn't support it, using regexp-based fallback is mostly fine.