The middle approach is the only one that is not lock-free.

The first approach is lock-free, but as the author says, it wastes an element.

But here's the thing. If your element is a character, and your buffer size is, say, 256 bytes, and you are using 8-bit unsigned characters for indices, the one wasted byte is less than one percent of your buffer space, and also is compensated for by the simplicity and reduced code size.

I've used the "Waste an element" one for ages on microcontrollers where I don't want to deal with the overhead in an ISR.

Agreed.

The article author claims that the "don't waste an element" code is also more efficient, but that claim seems to be based on a hard-on about the post-increment operator, rather than any kind of dive into the cyclometric complexity, or even, y'know, just looking at the assembler output from the compiler.