Closed Bug 1144521 Opened 10 years ago Closed 9 years ago

taskcluster-base: Implement [alert-operator] logging function

Categories

(Taskcluster :: Services, defect)

All
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jonasfj, Unassigned)

References

Details

For better or worse we've come up with the semi crazy convention that log lines containing the string "[alert-operator]" is something we should monitor. It's a very easy way of setting up log alerts and not having to implement email logic in every component (just another thing that can fail). Also papertail provides limits to number of emails they send. Anyways, doing this with: debug("[alert-operator] Error in %s, because of: %j", where, obj); Is bad because: - We can misspell "[alert-operator]" - We can silence debug-module with DEBUG env var - We can take any other action if we wanted, like impl. custom alert logic So we should implement: base.log.alert("Error in %s, because of: %j", where, obj); For now we just implement it in taskcluster-base. And use console.error("[alert-operator] " + msg, args...); In the future maybe have different error logging levels. Using debug() for informational logs is great, but it's probably bad for error logs because it's easy to silence.
See Also: → 1156320
Yea, I'm definitely a fan of using whatever we can to log things in such a way: log.critical('page jonas') log.warn('message') log.info('message') This makes sure we're consistent with how we're logging across projects using the same logger, uses severity levels that you'll find in other types of loggers, and we don't have to rely on a tag to add that could be misspelled (I've already made the mistake of not including the hyphen in alert-operator).
Also want something that have different handlers depending on severity of the event. such as log and debug go to a local file that's backed by s3 at some point, and critical and warning events go to papertrail.
I am a fan of structured logs. What I'd really like is something that I can do logging like this: log.error('Found a problem', { workerTypeName: x, spotBid: y, minBid: z, }); so that we process it later into a human readable format but maintain the machine-parsing ability. Internally, we'd take the message and things like __filename /Date.now() and put them into keys and log the messages as a stream of json objects. Any time I've seen logging use any approach that's not structured logging and relying on string formatting for later understanding, the logs have been next to useless. It's way easier to produce human readable logs from structured than the other way around. When we have structured logs, we could do fun things like defining unique error codes and ensure that all log messages will have: { level: x, when: y, code: z } which would make it trivial to do alerting which we hit something consistently instead of once or twice a day. We could also build trends and have actual dashboards that show error rates of each type and all that jazz. Having logs is great for post-failure dissection, but if we have structured logging, we can build tools to help us spot problems before they turn into disaster. I've played with bunyan logging a little bit for node and I liked it. Ideally, whatever we go with would be possible to interact with from other languages.
bunyan + loggly does look interesting. I just noticed that loggly pricing is competitive again :) I'm not sure we should drop usage for "debug", but perhaps "debug" should only be for debug messages. Not for things that we want to log in production, but only for logging in development. It's certainly more work to pass around a non-global reference to a log object. --- So my initial bug was just about a simple hack. But if someone is willing to do some research, maybe outline an etherpad, and present libraries to sell us, then we could do a meeting and all agree on one thing. I know jlal has been keen on loggly in combination with structured logs before.
Component: TaskCluster → General
Product: Testing → Taskcluster
Component: General → Platform Libraries
Component: Platform Libraries → Platform and Services
Let's not do this is... We have sentry now, it's much better to send errors to sentry :)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Component: Platform and Services → Services
You need to log in before you can comment on or make changes to this bug.