I think it's difficult to say without knowing how the system is deployed and administered. "If a SMTP mailer trying to send email to somewhere logs 'cannot contact port 25 on <remote host>', that is not an error in the local system"
Maybe or maybe not. If the connection problem is really due to the remote host then that's not the problem of the sender. But maybe the local network interface is down, maybe there's a local firewall rule blocking it,...
If you know the deployment scenario then you can make reasonable decisions on logging levels but quite often code is generic and can be deployed in multiple configurations so that's hard to do
The point is that if your program itself take note of the error from the library it is ok. You, as the program owner, can decide what to do with it (error log or not).
But if you are the SMTP library and that you unilaterally log that as an error. That is an issue.
This would require a complete new ecosystem and likely new language where any degradation of code flow becomes communicatable in a standardized and fully documented fashion.
The closest we have is something like Java with exceptions in type signatures, but we would have to ban any kind of exception capture except from final programs, and promote basically any logger call int an exception that you could remotely suppress.
We could philosophize about a world with compilers made out of unobtanium - but in this reality a library author cannot know what conditions are fixable or necessitate a fix or not. And structured logging lacks has way too many deficiencies to make it work from that angle.
The counterpoint made above is while what you describe is indeed the way the author likes to see it that doesn't explain why "an error is something which failed that the program was unable to fix automatically" is supposed to be any less valid a way to see it. I.e. should error be defined as "the program was unable to complete the task you told it to do" or only "things which could have worked but you need to explicitly change something locally".
I don't even know how to say whether these definitions are right or wrong, it's just whatever you feel like it should be. The important thing is what your program logs should be documented somewhere, the next most important thing is that your log levels are self consistent and follow some sort of logic, and that I would have done it exactly the same is not really important.
At the end of the day, this is just bikeshedding about how to collapse ultra specific alerting levels into a few generic ones. E.g. RFC 5424 defines 8 separate log levels for syslog and, while that's not a ceiling by any means, it's easy to see how there's already not really going to be a universally agreed way to collapse even just these down to 4 categories.
Any robust system isn’t going to rely on reading logs to figure out what to do about undelivered email anyway. If you’re doing logistics the failure to send an order confirmation needs to show up in your data model in some manner. Managing your application or business by logs is amateur hour.
There’s a whole industry of “we’ll manage them for you” which is just enabling dysfunction.
> But maybe the local network interface is down, maybe there's a local firewall rule blocking it,...
That's exactly why you log it as a warning. People get warned all the time about the dangers of smoking. It's important that people be warned about smoking; these warnings save lives. People should pay attention to warnings, which let them know about worrisome concerns that should be heeded. But guess what? Everyone has a story about someone who smoked until they were 90 and died in a car accident. It is not an error that somebody is smoking. Other systems will make their own bloody decisions and firewalling you off might be one of them. That is normal.
What do you think a warning means?
How about this:
- An error is an event that someone should act on. Not necessarily you. But if it's not an event that ever needs the attention of a person then the severity is less than an error.
Examples: Invalid credentials. HTTP 404 - Not Found, HTTP 403 Forbidden, (all of the HTTP 400s, by definition)
It's not my problem as a site owner if one of my users entered the wrong URL or typed their password wrong, but it's somebody's problem.
A warning is something that A) a person would likely want to know and B) wouldn't necessarily need to act on
INFO is for something a person would likely want to know and unlikely needs action
DEBUG is for something likely to be helpful
TRACE is for just about anything that happens
EMERG/CRIT are for significant errors of immediate impact
PANIC the sky is falling, I hope you have good running shoes
If you're logging and reporting on ERRORs for 400s, then your error triage log is going to be full of things like a user entering a password with insufficient complexity or trying to sign up with an email address that already exists in your system.
Some of these things can be ameliorated with well-behaved UI code, but a lot cannot, and if your primary product is the API, then you're just going to have scads of ERRORs to triage where there's literally nothing you can do.
I'd argue that anything that starts with a 4 is an INFO, and if you really wanted to be through, you could set up an alert on the frequency of these errors to help you identify if there's a broad problem.
The frequency is important and so is the answer to "could we have done something different ourselves to make the request work". For example in credit card processing, if the remote network declines, then at first it seems like not your problem. But then it turns out for many BINs there are multiple choices for processing and you could add dynamic routing when one back end starts declining more than normal. Not a 5xx and not a fault in your process, but a chance to make your customer experience better.
You have HTTP logs tracked, you don't need to report them twice, once in the HTTP log and once on the backend. You're just effectively raising the error to the HTTP server and its logs are where the errors live. You don't alert on single HTTP 4xx errors because nobody does, you only raise on anomalous numbers of HTTP 4xx errors. You do alert on HTTP 5xx errors because as "Internal" http errors those are on you always.
In other words, of course you don't alert on errors which are likely somebody else's problem. You put them in the log stream where that makes sense and can be treated accordingly.
> An error is an event that someone should act on. Not necessarily you.
Personally, I'd further qualify that. It should be logged as an error if the person who reads the logs would be responsible for fixing it.
Suppose you run a photo gallery web site. If a user uploads a corrupt JPEG, and the server detects that it's corrupt and rejects it, then someone needs to do something, but from the point of view of the person who runs the web site, the web site behaved correctly. It can't control whether people's JPEGs are corrupt. So this shouldn't be categorized as an error in the server logs.
But if you let users upload a batch of JPEG files (say a ZIP file full of them), you might produce a log file for the user to view. And in that log file, it's appropriate to categorize it as an error.
Counter argument. How do you know the user uploaded a corrupted image and it didn't get corrupted by your internet connection, server hardware, or a bug in your software stack?
You cannot accurately assign responsibility until you understand the problem.
This is just trolling. The JPEG is corrupt if the library that reads it says it is corrupt. You log it as a warning. If you upgrade the library or change your upstream reverse proxy, and starting getting 1000x the number of warnings, you can still recognize that and take action without personally inspecting each failed upload to be sure you haven't yet stumbled on the one edge case where the JPEG library is out of spec.
That's the difference between an HTTP 4xx and 5xx
4xx is for client side errors, 5xx is for server side errors.
For your situation you'd respond with an HTTP 400 "Bad Request" and not an HTTP 500 "Internal Server Error" because the problem was with the request not with the server.