Yes. Examples of non-defects that should not be in the ERROR loglevel:
* Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)
* ISE in downstream service (return HTTP 5xx and increment a metric but don’t emit an error log)
* Network error
* Downstream service overloaded
* Invalid request
Basically, when you make a request to another service and get back a status code, your handler should look like:
logfunc = logger.error if 400 <= status <= 499 and status != 429 else logger.warning
(Unless you have an SLO with the service about how often you’re allowed to hit it and they only send 429 when you’re over, which is how it’s supposed to work but sadly rare.)
> Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)
So people writing software are supposed to guess how your organization assigns responsibilities internally? And you're sure that the database timeout always happens because there's something wrong with the database, and never because something is wrong on your end?
No; I’m not understanding your point about guessing. Could you restate?
As for queries that time out, that should definitely be a metric, but not pollute the error loglevel, especially if it’s something that happens at some noisy rate all the time.
I think OP is making two separate but related points, a general point and a specific point. Both involve guessing something that the error-handling code, on the spot, might not know.
1. When I personally see database timeouts at work it's rarely the database's fault, 99 times out of 100 it's the caller's fault for their crappy query; they should have looked at the query plan before deploying it. How is the error-handling code supposed to know? I log timeouts (that still fail after retry) as errors so someone looks at it and we get a stack trace leading me to the bad query. The database itself tracks timeout metrics but the log is much more immediately useful: it takes me straight to the scene of the crime. I think this is OP's primary point: in some cases, investigation is required to determine whether it's your service's fault or not, and the error-handling code doesn't have the information to know that.
2. As with exceptions vs. return values in code, the low-level code often doesn't know how the higher-level caller will classify a particular error. A low-level error may or may not be a high-level error; the low-level code can't know that, but the low-level code is the one doing the logging. The low-level logging might even be a third party library. This is particularly tricky when code reuse enters the picture: the same error might be "page the on-call immediately" level for one consumer, but "ignore, this is expected" for another consumer.
I think the more general point (that you should avoid logging errors for things that aren't your service's fault) stands. It's just tricky in some cases.
Also everywhere I have worked there are transient network glitches from time to time. Timeout can often be caused by these.
> the database is owned by a separate oncall rotation
Not OP, but this part hits the same for me.
In the case your client app is killing the DB through too many calls (e.g. your cache is not working) you should be able to detect it and react, without waiting for the DB team to come to you after they investigated the whole thing.
But you can't know in advance if the DB connection errors are your fault or not, so logging it to cover the worse case scenario (you're the cause) is sensible.
I agree that you should detect this, just through a metric rather than putting DB timeouts in the ERROR loglevel.
But what's the base of your metric ?
I feel you're thinking about system wide downtime with everything timing out consistently, which would be detected by the generic database server vitals and basic logs.
But what if the timeouts are sparse and only 10 or 20% more than usual from the DB POV, but it affects half of your registration services' requests ? You need it logged application side so the aggregation layer has any chance of catching it.
On writing to ERROR or not, the hresholds should be whatever your dev and oncall teams decides. Nobody outside of them will care, I feel it's like arguing which drawer the socks should go.
I was in an org where any single error below CRITICAL was ignored by the oncall team , and everything below that only triggered alerts on aggregation or special conditions. Pragmatically, we ended up slicing it as ERROR=goes to the APM, anything below=no aggregation, just available when a human wants to look at it for whatever reason. I'd expect most orgs to come with that kind of split, where the levels are hooked to processes, and not some base meaning.
> No; I’m not understanding your point about guessing. Could you restate?
In the general case, the person writing the software has no way of knowing that "the database is owned by a separate oncall rotation". That's about your organization chart.
Admittedly, they'd be justified in assuming that somebody is paying attention to the database. On the other hand, they really can't be sure that the database is going to report anything useful to anybody at all, or whether it's going to report the salient details. The database may not even know that the request was ever made. Maybe the requests are timing out because they never got there. And definitely maybe the requests are timing out because you're sending too many of them.
I mean, no, it doesn't make sense to log a million identical messages, but that's rate limiting. It's still an error if you can't access your database, and for all you know it's an error that your admin will have to fix.
As for metrics, I tend to see those as downstream of logs. You compute the metric by counting the log messages. And a metric can't say "this particular query failed". The ideal "database timeout" message should give the exact operation that timed out.
I wish I lived in a world where that worked. Instead, I live in a world where most downstream service issues (including database failures, network routing misconfigurations, giant cloud provider downtime, and more ordinary internal service downtime) are observed in the error logs of consuming services long before they’re detected by the owners of the downstream service … if they ever are.
My rough guess is that 75% of incidents on internal services were only reported by service consumers (humans posting in channels) across everywhere I’ve worked. Of the remaining 25% that were detected by monitoring, the vast majority were detected long after consumers started seeing errors.
All the RCAs and “add more monitoring” sprints in the world can’t add accountability equivalent to “customers start calling you/having tantrums on Twitter within 30sec of a GSO”, in other words.
The corollary is “internal databases/backend services can be more technically important to the proper functioning of your business, but frontends/edge APIs/consumers of those backend services are more observably important by other people. As a result, edge services’ users often provide more valuable telemetry than backend monitoring.”
But everything you’re describing can be done with metrics and alerts; there’s no need to spam the ERROR loglevel.
My point is that just because those problems can be solved with better telemetry doesn’t mean that is actually done in practice. Most organizations do are much more aware of/sensitive to failures upstream/at the edge than they are in backend services. Once you account for alert fatigue, crappy accountability distribution, and organizational pressures, even the places that do this well often backslide over time.
In brief: drivers don’t obey the speed limit and backend service operators don’t prioritize monitoring. Both groups are supposed to do those things, but they don’t and we should assume they won’t change. As a result, it’s a good idea to wear seatbelts and treat downstream failures as urgent errors in the logs of consuming services.
4xx is for invalid requests. You wouldn't log a 404 as an error
I’m talking about codes you receive from services you call out to.
What if user sends some sort of auth token or other type of data that you yourself can't validate and third party gives you 4xx for it? You won't know ahead of time whether that token or data is valid, only after making a request to the third party.
Oh that makes sense.
There are still some special cases, because 404 is used for both “There’s no endpoint with that name” and “There’s no record with the ID you tried to look up.”