Hacker News

Because it's far more reliable to use proper parsers instead of a bunch of regular expressions. Most languages cannot be properly parsed with regexes.

Those files are compiled tree-sitter grammars, read up on why it exists and where it is used instead of me poorly regurgitating official documentation:

https://tree-sitter.github.io/tree-sitter

f311a 17 hours ago [ - ]

Funny enough, they are less than 10MB when compressed. I guess they could use something like upx to compress these binaries.

The whole Linux release is 15mb, but it uncompresses to 16MB binary and 200MB grammars on disk.

Why do we need to have 40MB of Verilog grammars on disk when 99% of people don't use them?

homebrewer 16 hours ago [ - ]

That would waste CPU time and introduce additional delays when opening files.

They could probably lazily install the grammars like neovim does, but as someone who doesn't have much faith in the reliability of internet infrastructure, I'll personally take it...

Just ran `:TSInstall all` in neovim out of curiosity, and the results were predictable:

  ~/.local/share/nvim/lazy/nvim-treesitter/parser
  files 309
  size 232M

  /usr/lib/helix/runtime/grammars
  files 246
  size 185M

If disk space is important for your use case, I guess filesystem compression would save far more than just compressing binaries with upx. btrfs+zstd handle those .so well:

  $ compsize ~/.local/share/nvim/lazy/nvim-treesitter/parser
  Type       Perc     Disk Usage   Uncompressed Referenced
  TOTAL       11%       26M         231M         231M

  $ compsize /usr/lib/helix/runtime/grammars
  Type       Perc     Disk Usage   Uncompressed Referenced
  TOTAL       12%       23M         184M         184M

f311a 16 hours ago [ - ]

I mean, they could decompress it once when using a language for the first time. It will still be fully offline, but with a bit uncompressing.

e12e 12 hours ago [ - ]

If this is a concern, why not compress at the filesystem level?

Panzerschrek 13 hours ago [ - ]

For real parsing a proper compiler codebase (via a language server implementation) should be used. Writing something manually can't work properly, especially with languages like C++ and Rust with complex includes/imports and macros. Newer LSP editions support syntax-based highlighting/colorizing, but if some LSP implementation doesn't support it, using regexp-based fallback is mostly fine.