I like to have my lexers operate on `FILE*`, rather than string-views. This has some real-world performance implications (not good ones); but, it does mean I can operate on streams. If the user has a c-string, the string can be easily wrapped by `funopen()` or `fopencookie()` to provide a `FILE*` adapter layer. (Most of my lexers include one of these, out-of-the-box.)

Everything else, I stole from Bob Nystrom: I keep a local copy of the token's string in the token, aka, `char word[64]`. I try to minimize "decision making" during lexing. Really, at the consumption point we're only interested in an extremely small number of things: (1) does the lexeme start with a letter or a number?; (2) is it whitespace, and is that whitespace a new line?; or, (3) does it look like an operator?

The only place where I've ever considered goto-threading was in keyword identification. However, if your language keeps keywords to ≤ 8 bytes, you can just bake the keywords into `uint64_t`'s and compare against those values. You can do a crapload of 64b compares/ns.

The next level up (parsing) is slow enough to eat & memoize the decision making of the lexer; and, materially, it doesn't complicate the parser. (In fact: there's a lot of decision making that happens in the parser that'd have to be replicated in the lexer, otherwise.)

The result, overall, is you can have a pretty general-purpose lexer that you can reuse for a any old C-ish language, and tune to your heart's content, without needing a custom rewrite, each time.

> I like to have my lexers operate on `FILE`, rather than string-views. [...] it does mean I can operate on streams.

While I understand the desire to support one input interface for composability, reuse, etc. I can't help wondering why 'FILE'. Isn't reading from a string more "universal"?

> If the user has a c-string, the string can be easily wrapped by `funopen()` or `fopencookie()` to provide a `FILE` adapter layer.

And if the user has a file, it's easy to read it into memory in advance.

What's the benefit of FILE over a string?

Have you considered making your lexer operate in push mode instead?

This does mean you have to worry about partial tokens ... but if you limit yourself to feeding full lines that mostly goes away.

Besides, for reasonable-size workloads, "read the whole file ahead of time" is usually a win. The only time it's tempting not to do so is for REPLs.

I agree. But, I also like the discipline of lexing from `FILE*`. I've ended up with cleaner separation of concerns throughout the front-end stack, because I can't dip back into the well, unless I'm thinking very clearly about that operation. For instance, I keep around coordinates of things, rather than pointers, etc.

I'd do this in almost any other language than C :)

In C, I like just passing a const char * around as input; this also gives me ability to return progress and unget chars as an added bonus.

https://github.com/codr7/shi-c/blob/b1d5cb718b7eb166a0a93c77...

The tragic thing is that you can't do `fgetwc()` on a `FILE *` produced by `fopencookie()` on Linux. glibc will crash your program deliberately as soon as there is a non-ASCII char in that stream (because, reasons?). But it does work with `funopen()` on a BSD, like macOS. I'm using that to read wide characters from UTF-8 streams.

Wide characters are best avoided even on platforms where it doesn't mean UTF-16. It's better to stay in UTF-8 mode, and only verify that it's well-formed.

But at some point you'll want to know whether that code point you read `iswalpha()` or whatever, so you'll have to decode UTF-8 anyway.

At the parser-level, though; not down in the lexer. I intern unique user-defined strings (just with a hashcons or whatever the cool kids call it, these days). That defers the determination of correctness of UTF-kness to "someone else".