Open Bug 1436274 Opened 7 years ago Updated 4 years ago

write generic-worker log output to event log (rather than a log file)

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

People

(Reporter: grenade, Unassigned)

References

(Blocks 1 open bug)

Details

on windows, the event log is designed to handle high volume logging in as close to real time as possible. currently generic worker logs to a file rather than to the system event log. this adds overhead for nxlog (the log aggregation worker on instances) which needs to parse the application log file looking for new lines before forwarding to papertrail. the effect of the current mechanism is that output from generic worker is logged in papertrail with an incorrect event time and is displayed out of step from all other application logs on the instance. it makes it difficult to debug or turns debugging into guesswork because it is not known when output from generic worker was logged in relation to other events which occured on the system. if generic worker were modified to send log data to the event log (as is good practice for windows applications), the log timestamps in papertrail would be accurate and would shed light on some of the more difficult to debug issues we have on some windows instances (eg: win 10 instances often go into an impaired state while running gpu tests but we need accurate gw timestamp logs to determine exactly when the instance becomes impaired, bug 1372172)
QA Contact: pmoore
(In reply to Rob Thijssen (:grenade UTC+2) from comment #0) > the effect of the current mechanism is that output from generic worker is > logged in papertrail with an incorrect event time The log lines from generic-worker already contain the timestamp of the log entry evaluated at source, which you can use to cross reference with the event log: https://github.com/taskcluster/generic-worker/blob/a458804f5544ee3ecaf74d6647a26e3a4d984199/main.go#L903-L913 Despite this, the generic-worker logs will be of litle value, since they only tell you when they launch a process to run a task command. The logs from those commands themselves (task logs) will be much more detailed and informative. I would recommend forwarding the task logs (Z:\task_XXXXXXXXX\generic-worker\live_backing.log) via nxlog to papertrail too, in order to see what is happening when an instance becomes impaired.
No longer blocks: 1372172
Component: Generic-Worker → Workers

Miles, I think this is covered by the work you are doing in bug 1459203, right?

Depends on: 1459203
Flags: needinfo?(miles)

Yes, that's correct. We'll be writing to an eventlog with the same name as is specified during the installation of the Generic Worker service (default: "Generic Worker".

The way I've implemented it thus far is to add an output to the log package, this doesn't include all stdout/stderr though. Is capturing output intended to be logged sufficient? Should we switch use of fmt prints to use log?

Flags: needinfo?(miles)

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #3)

Yes, that's correct. We'll be writing to an eventlog with the same name as is specified during the installation of the Generic Worker service (default: "Generic Worker".

The way I've implemented it thus far is to add an output to the log package, this doesn't include all stdout/stderr though. Is capturing output intended to be logged sufficient?

Yes, that's fine.

Should we switch use of fmt prints to use log?

Where have you seen those? I think the only time we intentionally log to standard out from the worker is in response to generic-worker --help|-h|--version. To be safe, we should send both stderr and stdout to the event log, but if you find code writing to stderr instead of stdout, let's review it - it could be a mistake.

Flags: needinfo?(miles)

I don't think I see any real fmt prints besides those, so we're good! Anything logged using the log package will be sent to the event log. Anything that would ordinarily write to stdout/stderr when generic worker is run as a Windows service will go nowhere, as the service has no valid handles / files for either.

Flags: needinfo?(miles)
Blocks: 1565390
No longer blocks: 1565390
QA Whiteboard: [lang=go]

Miles: is there more to do here? Not sure whether comment #5 implies we're done or not.

Flags: needinfo?(miles)

I suspect this would get folded into the log work in worker-runner. Basically configuring an event-log backend for worker-runner would do the trick here.

Flags: needinfo?(miles)
You need to log in before you can comment on or make changes to this bug.