Closed Bug 1161658 Opened 9 years ago Closed 9 years ago

send buildbot-master cron mail to syslog

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: arich)

References

Details

Attachments

(1 file)

The buildbot masters are currently sending their output to email. We should probably move this to syslog and set up alerts where necessary.

Cron entries are defined in: modules/buildmaster/templates/buildmaster-cron.erb 

Coop and/or buildduty folks, anything you don't want sent to syslog vs email? And what sort of alerts should we set up?
Flags: needinfo?(sdeckelmann)
Flags: needinfo?(mgervasini)
Flags: needinfo?(coop)
One of the the cron jobs already pipes output somewhere (either to /dev/null or a cleanup file depending on the stage of the job). I'm not sure what we'd want done with that one (taken from one specific machine, so there will be variances in the directory name because of different jobs):

15 * * * * cltbld lockfile -60 -r 3 /var/lock/cltbld/lockfile.bm01-tests1-linux32_cleanup 2>/dev/null && (nice -n 19 /builds/buildbot/tests1-linux32/bin/python /builds/buildbot/tests1-linux32/tools/buildfarm/maintenance/master_cleanup.py -t4 /builds/buildbot/tests1-linux32/master ; rm -f /var/lock/cltbld/lockfile.bm01-tests1-linux32_cleanup) >> cleanup.log 2>&1

The log file I looked at on that machine was 5 days old and had no data in it.

And I'm not sure why we'd ever remove a lock file without looking to clean up a process?

@hourly cltbld find /var/lock/cltbld -name lockfile.bbdb -mmin +360 -delete
(In reply to Amy Rich [:arich] [:arr] from comment #0)
> The buildbot masters are currently sending their output to email. We should
> probably move this to syslog and set up alerts where necessary.

The reconfig code that landed today predates our use of papertrail, but should absolutely be smarter about where it logs. 

Happy to pipe the output to syslog instead. Is there a canonical way to do that? logger?

> Coop and/or buildduty folks, anything you don't want sent to syslog vs
> email? And what sort of alerts should we set up?

I haven't heard a peep out of the other master cronjobs, so I wouldn't worry about them.

For the reconfig code, we'll want to know if the script encounters an existing lockfile for starters.
Flags: needinfo?(coop)
My suggestions for the two obvious candidates to get them into syslog.
coop: I added a patch for the two obvious ones. What would we see in the logs if we encountered a lock file? We can search for the tag and the strings we want to alert on to narrow it down to the specific application so we aren't getting extraneous stuff.

The twistd log munger is the other thing that we get a lot of output from. That's easy to slap into syslog, too. What, if anything, do we want to search for there?
Flags: needinfo?(coop)
(In reply to Amy Rich [:arich] [:arr] from comment #4)
> coop: I added a patch for the two obvious ones. What would we see in the
> logs if we encountered a lock file? We can search for the tag and the
> strings we want to alert on to narrow it down to the specific application so
> we aren't getting extraneous stuff.

We handle the lockfile within the reconfig script. The current threshold is set to 120 minutes, after which we start writing the following message to stderr every hour: 

"Reconfig lockfile is older than ${LOCKFILE_MAX_AGE} minutes."

https://hg.mozilla.org/build/tools/file/fd55dd6ea190/buildfarm/maintenance/maybe_reconfig.sh#l81
 
> The twistd log munger is the other thing that we get a lot of output from.
> That's easy to slap into syslog, too. What, if anything, do we want to
> search for there?

The watch_twistd_log script is a hot mess. 

We don't care about 99% of the email it generates, because to fix any of the exceptions would involve investing time in buildbot internals or upgrades. The 1% we do care about are UnauthorizedLogin exceptions. These usually indicate a poorly- or partially-configured slave is trying to connect to the master. The slave tries to connect multiple times/sec, leading to a multi-MB mail that is usually too big to send (unless we do aggregation beforehand).

I would be happy to never see these emails again *except* for the UnauthorizedLogin exceptions. Don't have an example in front of me, but the code that generates the output is here:

https://hg.mozilla.org/build/tools/file/fd55dd6ea190/buildfarm/maintenance/watch_twistd_log.py#l62
Flags: needinfo?(coop)
Attachment #8601706 - Flags: review?(coop)
I've set up https://papertrailapp.com/searches/4642174/edit and https://papertrailapp.com/searches/4642184/edit in anticipation for sending the twistd log exception and maybe_reconfig.sh output to papertrail.
Attachment #8601706 - Flags: review?(coop) → review+
Depends on: 1162154
Blocks: 1150557
No longer blocks: 150557
Flags: needinfo?(mgervasini)
Amy: Anything left to do here?
coop: the dep bug to fix the actual script is still open.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Flags: needinfo?(sdeckelmann)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: