Hacker News

I assume the grep compatible bit is attractive to some people. Not me, but they exist.

I find myself returning to grep from my default of rg because I'm just too lazy to learn a new regex language. Stuff like word boundaries "\<word\>" or multiple patterns "\(one\|two\)".

masklinn 2 years ago [ - ]

That seems like the weirdest take ever: ripgrep uses pretty standard PCRE patterns, which are a lot more common than posix’s bre monstrosity.

To me the regex langage is very much a reason to not use grep.

derriz 2 years ago [ - ]

A bit hyperbolic, no?

If you consider it "the weirdest ever", I'm guessing that I'm probably older than you. I've certainly been using regex long before PCRE became common.

As a vim user I compose 10s if not 100s of regexes a day. It does not use PCRE. Nor does sed, a tool I've been using for decades. Do you also recommend not using these?

comex 2 years ago [ - ]

I use all of those tools but the inconsistency drives me crazy as it's hard to remember which syntax to use where. Here's how to match the end of a word:

ripgrep, Python, JavaScript, and practically every other non-C language: \b

vim: \>

BSD sed: [[:>:]]

GNU sed, GNU grep: \> or \b

BSD grep: \>, \b, or [[:>:]]

less: depends on the OS it's running on

burntsushi 2 years ago [ - ]

Did you know that not all of those use the same definition of what a "word" character is? Regex engines differ on the inclusion of things like \p{Join_Control}, \p{Mark} and \p{Connector_Puncuation}. Although in the case of \p{Connector_Punctuation}, regex engines will usually at least include underscore. See: https://github.com/BurntSushi/rebar/blob/f9a4f5c9efda069e798...

And then there's \p{Letter}. It can be spelled in a lot of ways: \pL, \p{L}, \p{Letter}, \p{gc=Letter}, \p{gc:Letter}, \p{LeTtEr}. All equivalent. Very few regex engines support all of them. Several support \p{L} but not \pL. See: https://github.com/BurntSushi/rebar/blob/f9a4f5c9efda069e798...

pbhjpbhj 2 years ago [ - ]

`pgrep`, or `grep -P`, uses PCRE though, AFAIUI.

2 years ago [ - ]

[deleted]

burntsushi 2 years ago [ - ]

ripgrep's regex syntax is pretty similar to grep -E. So if you know grep -E, most of that will transfer over.

Also, \< and \> are in ripgrep 14. Although you usually just want to use the -w/--word-regexp flag.

xoranth 2 years ago [ - ]

> Also, \< and \> are in ripgrep 14

Isn't that inconsistent with the way Perl's regex syntax was designed? In Perl's syntax an escaped non-ASCII character is always a literal [^1], and that is guaranteed not to change.

That's nice for beginners because it saves you from having to memorize all the metacharacters. If you are in doubt you on whether something has a special meaning, you just escape it.

[^1]: https://perldoc.perl.org/perlrebackslash#The-backslash

burntsushi 2 years ago [ - ]

Yes, it's inconsistent with Perl. But there are many things in ripgrep's default regex engine that are inconsistent with Perl, including the fact that all patterns are guaranteed to finish a search in linear time with respect to the haystack. (So no look-around or back-references are supported.) It is a non-goal of ripgrep to be consistent with Perl. Thankfully, if you want that, then you can get pretty close by passing the -P/--pcre2 flag.

With that said, I do like Perl's philosophy here. And it was my philosophy too up until recently. I decided to make an exception for \< and \> given their prevalence.

It was also only relatively recently that I made it possible for superfluous escapes to exist. Prior to ripgrep 14, unrecognized escapes were forbidden:

    $ echo '@' | rg-13.0.0 '\@'
    regex parse error:
        \@
        ^^
    error: unrecognized escape sequence
    $ echo '@' | rg '\@'
    @

I had done it this way to make it possible to add new escape sequences in a semver compatible release. But in reality, if I were to ever add new escape sequences, it use one of the ascii alpha-numeric characters, as Perl does. So I decided it was okay to forever and always give up the ability to make, e.g., `\@` mean something other than just matching a literal `@`.

`\<` and `\>` are forever and always the lone exceptions to this. It is perhaps a trap for beginners, but there are many traps in regexes, and this seemed worth it.

Note that `\b{start}` and `\b{end}` also exist and are aliases for `\<` and `\>`. The more niche `\b{start-half}` and `\b{end-half}` also exist, and those are what are used to implement the -w/--word-regexp flag. (Their semantics match GNU grep's -w/--word-regexp.) For example, `\b-2\b` will not match in `foo -2 bar` since `-` is not a word character and `\b` demands `\w` on one side and `\W` on the other. However, `rg -w -e -2` will match `-2` in `foo -2 bar`:

    $ echo 'foo -2 bar' | rg -w -e '\b-2\b'
    $ echo 'foo -2 bar' | rg -w -e -2
    foo -2 bar

xoranth 2 years ago [ - ]

Ok, makes sense. And thanks for the detailed explaination about word boundaries and the hint about the --pcre flag (I hadn't realized it existed).