Closed Bug 1452002 Opened 7 years ago Closed 6 years ago

Figure out what logging looks like in a redployability world

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bstack, Assigned: bstack)

References

Details

Probably something to do with aggregating logs out of kubernetes and also figuring out an easy way to make something work for workers as well!
I believe pmoore was going to look at this, per the Berlin meeting..
Assignee: nobody → pmoore
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #2) > https://cloud.google.com/logging/ That looks neat! I had a meeting with :jonasfj and :bstack where I believe we decided: 1) The only requirement for an application should be that it logs to a file *somewhere*. The format of that file is the responsibility of the component - since typically there will be third party apps or the OS producing logs that we cannot influence, but also wish to aggregate. 2) A logging solution needs to be able to feed multiple logs files per system to a log aggregator 3) It should be possible to deploy taskcluster without a built-in log aggregation service, for users that wish to integrate with their own existing log aggregation service I recall we discussed the possibility of having structured logs, whereby non-taskcluster systems could log raw data that would be trivially transposed into structured form. I think the idea was that the log forwarding process would take care of performing this translation on-the-fly for raw unstructured logs, and for services that supported structured logging already, no transposing would be required. Brian, Jonas, is this how you remember our conversation too? I'll follow up with jhford to get his input too, as he has been pivotal in implementing structured logging in our core services.
Flags: needinfo?(jopsen)
Flags: needinfo?(bstack)
Summary: Figure out what logging looks like in a r14y world → Figure out what logging looks like in a redployability world
(In reply to Pete Moore [:pmoore][:pete] from comment #3) > (In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #2) > > https://cloud.google.com/logging/ > > That looks neat! ++ I agree > > I had a meeting with :jonasfj and :bstack where I believe we decided: > > 1) The only requirement for an application should be that it logs to a file > *somewhere*. The format of that file is the responsibility of the component > - since typically there will be third party apps or the OS producing logs > that we cannot influence, but also wish to aggregate. Would this file be allowed to be stdout? k8s sorta-kinda handles logs at that level just fine for now. What do we mean by component in this context? > 2) A logging solution needs to be able to feed multiple logs files per > system to a log aggregator Yes, this would be good to have eventually as well for structured logging purposes and to get feeds from different systems (maybe I should call this component?). > 3) It should be possible to deploy taskcluster without a built-in log > aggregation service, for users that wish to integrate with their own > existing log aggregation service Yep! Probably making it as easy as possible to use your own aggregator is best. k8s may provide some abstractions here. > > I recall we discussed the possibility of having structured logs, whereby > non-taskcluster systems could log raw data that would be trivially > transposed into structured form. I think the idea was that the log > forwarding process would take care of performing this translation on-the-fly > for raw unstructured logs, and for services that supported structured > logging already, no transposing would be required. Yeah. I don't know that this needs to happen in order for redeployment to happen however. Probably a separate-but-related issue? > > Brian, Jonas, is this how you remember our conversation too? I purge all conversations from my brain every 3 days! However yes this is basically what I remember too. > > I'll follow up with jhford to get his input too, as he has been pivotal in > implementing structured logging in our core services. ++ Yes, this is definitely something jhford should be involved in.
Flags: needinfo?(bstack)
Hey John, Based on the work you've done with structured logging in taskcluster, do you have input/requirements/wishes to feed into this? Thanks.
Flags: needinfo?(jhford)
My concerns regarding logging was mostly that: - the creation of log-destinations should be programmatic - log destinations should be able to expire - log destinations should be authenticated - log destinations should have server-side tagging These are mostly administration and security concerns. They ought to be obvious, but I haven't seen many logging systems that don't simply run on best intentions and network security.
Flags: needinfo?(jopsen)
I like the idea of being able to interleave system messages into the logs. I suspect that the best approach would be something where we have an aggregator on each system which takes in logs from syslog, dmesg, journalctl as well as our services and ensures they're in a machine parsable structure. For system level logging, those would probably just be a simple wrapper and a tag for the source (e.g. source: syslog, source:dmesg). We could then have a router which receives messages from the on-system aggregator and directs them where needed, based on tags and configuration. This is where things like audit-trail messages would be picked out and sent to the appropriate system. It would be ideal if we could have log levels for each of the sources, which may be possible for things like journalctl logs and definitely for our services. If we have at least basic structure and log levels, our aggregator can do smart things like everything from INFO-FATAL into a log file that lives for 5 days, everything from ERROR-FATAL for 30 days and everything which is an audit log for eternity. That way, we can have really verbose logs for the last few days without storing the entire world. If we were to pick a storage provider in the same DC for these short-lived logs, we'd have minimal overhead. I think we need to be really careful about requiring authentication for logging... Sometimes we have systems which we care about logs from before they have credentials. Maybe we could add a tag like "unauthenticatedLog: true" and keep those messages for a shorter amount of time if that client is not later authenticated. This would be for times like booting an instance, where we need certain startup code to get credentials set up before we can do things. If we're having trouble with that code, or maybe a kernel panic shortly after the networking stack and our logging aggregator go live, we definitely want logs. Maybe each logger generates a uuid at start as an identifier for that instance of a log source, and when we authenticate, that uuid is provided during authentication and all messages from that log source id are upgraded from unauthenticated to authenticated. Basically, logging should be for finding errors, and the more machinery we add between starting the machine and getting logs increases the likelihood of logging not helping us find errors. One of my biggest complaints with our papertrail account is that it's hard to find what I'm looking for. It would be lovely if each message was tagged with something like this at start: { loggerId: uuid.v4(), codebase: "https://github.com/taskcluster/super-service-3000", codebaseRev: "abcde123", category: "taskcluster-service" name: "super-service-3000-us-east-1" } And then following messages would include the loggerId, so that we could do things like trace down which exact changeset a given log message was from, given the loggerId and our ability to search for this initialization message.
Flags: needinfo?(jhford)
I think we need to aim for simple-and-quick here. Probably that's just configuring our deployment so that all service stdout gets dumped to a single location via syslog -- something that a user can then direct to papertrail or loggly or whatever they want. For the moment, note that worker deployments are not automated, so we don't need to worry about them. If we had the engineering bandwidth to do fancier things right now, I'd be all for it -- but we don't. If we can invent a solution that makes room for such fancier things later, without much additional effort right now, that would be ideal. Jonas, is this something you could take on?
Flags: needinfo?(jopsen)
Flags: needinfo?(jopsen)
Assignee: pmoore → bstack
Status: NEW → ASSIGNED
Component: Redeployability → Services
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.