Closed Bug 523827 Opened 15 years ago Closed 14 years ago

Detect hung slaves

Categories

(Release Engineering :: General, defect, P2)

x86
Linux

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: rail)

References

Details

(Whiteboard: [buildslaves] )

Attachments

(4 files, 4 obsolete files)

While trying to deploy a change to production-master this morning, I discovered at least 4 slaves were in a "stuck" state:

moz2-win32-slave18 was gracefully shutting down since the 19th
bm-xserve16 was doing a debug test from the 12th
bm-xserve18 was doing a debug test from the 14th
bm-xserve19 was doing a debug test from the 6th
One possible impl: a nagios check that watches /buildslaves periodically and looks for anything above a certain "last heard from" threshold. AFAICT these slaves were all in the "days ago" last heard from area.
(In reply to comment #1)
> One possible impl: a nagios check that watches /buildslaves periodically and
> looks for anything above a certain "last heard from" threshold. AFAICT these
> slaves were all in the "days ago" last heard from area.

Nevermind. This doesn't seem true for all of them.
I have a scraper for this purpose, it runs for the mobile master's buildslave page and detects when the slave was last heard from.  I have the threshold for mobile at 60 minutes.  Because of the PM1-PM2 split, the most useful section of this list is going to be the top "ATTENTION NEEDED LIST" section.  It puts output to http://production-mobile-master.mv.mozilla.com/n810-status (Mozilla MV VPN required) and some sample output at http://pastebin.mozilla.org/678275
Yeah, unfortunately, not all of the hung slaves have a high "last heard from". This could help for some, though.
is there an indicator on the buildslaves page that i could capture?
(In reply to comment #5)
> is there an indicator on the buildslaves page that i could capture?

NOt on /buildslaves. The way catlee discovered some of these was to look at /buildslaves/$slave and noticed that a build was running from days ago.
hmm, that would be possible with this script but would definately put a lot of load on the buildbot masters to generate buildslaves and each buildslave/$slave page for connected machines.

out of curiosity, what was the last heard from time on these?
(In reply to comment #7)
> hmm, that would be possible with this script but would definately put a lot of
> load on the buildbot masters to generate buildslaves and each buildslave/$slave
> page for connected machines.

Yeah, I'm concerned about this too. I think we're ok only polling every 4 or 6 hours or so, though.

> out of curiosity, what was the last heard from time on these?

0 seconds
(In reply to comment #3)
> I have a scraper for this purpose, it runs for the mobile master's buildslave
> page and detects when the slave was last heard from.

Any chance you can post that script somewhere? I'd love to use it as well.
(In reply to comment #7)
> hmm, that would be possible with this script but would definately put a lot of
> load on the buildbot masters to generate buildslaves and each buildslave/$slave
> page for connected machines.

The way we have pm/pm02 configured with every slave set up on both masters makes this a no go - it takes tens of seconds for the master to try to find previous builds for machines that were never connected (or got moved more than a few days ago). You can put ?builds=0 on the slave url (which means current build plus the last one) to make that less painful but it's still blocks any other http requests to the master. Perhaps there is something else buried in buildbot/status/web we can use instead.
Could we do something with nagios inspecting the twistd.log file on the slave and complaining if it's too old?
Philippe, it is on my personal HG repo http://hg.johnford.info/buildslaves-scrape/

Nick, You'll notice that we have 4 environments in a similar configuration to pm/pm02.  The action needed list is across all of the 4 buildbot masters we have for mobile

Chris, that sounds like a good idea and a lot less master-intrusive than scraping buildslaves.  I would be worried that as the last heard from time was 0 seconds that it is still writing to the logs.  Is this the case?  Maybe we could look for something like "has the last buildstep on this slave started more than 8 hours ago?"
gah, i haven't been able to push the lastest copy of that script.  When i remember where the latest copy is I will push it to my HG server
(In reply to comment #12)
> Nick, You'll notice that we have 4 environments in a similar configuration to
> pm/pm02.  The action needed list is across all of the 4 buildbot masters we
> have for mobile

Yeah, looks great for the mobile masters. I just think we'll kill pm if we poll the buildslaves and buildsslaves/$slave a lot.

(In reply to comment #11)
> Could we do something with nagios inspecting the twistd.log file on the slave
> and complaining if it's too old?

I like this idea a bunch. Looking at some slaves that were last heard from 10 to 12 hours ago (linux on pm02 are good for this) there's no heartbeat of keepalive messages there though. Just re-attaching to the builders after the reboot, and a 
 2009/10/22 11:15 PDT [Broker,client] I have a leftover directory 'linux_repack' that is not being used by the buildmaster: you can delete it now
which I don't think we can rely on. So a nagios check for a twistd.log that hasn't changed in a .... day ??
Component: Release Engineering → Release Engineering: Future
I think this needs to be done soon.
Component: Release Engineering: Future → Release Engineering
Moving to future as it isn't causing redness
Component: Release Engineering → Release Engineering: Future
Yes, this needs to be done soon. When this happens, it causes slaves to silently hang, and never accept new jobs. Over time, this means that we have fewer slaves actually processing jobs, which is bad for wait times. 

However, until we have someone who has time to work on this, it seems Future is accurate place for it.
Assignee: nobody → catlee
Component: Release Engineering: Future → Release Engineering
bm-xserve19 stuck doing a m-c debug everything else test run (crashtest) since the 23rd.
here's the last traceback on the slave:

2009-10-23 10:25:16-0700 [-] Unexpected error in main loop.
2009-10-23 10:25:16-0700 [-] Traceback (most recent call last):
2009-10-23 10:25:16-0700 [-]   File "/tools/buildbot/bin/buildbot", line 4, in <module>
2009-10-23 10:25:16-0700 [-]     runner.run()
2009-10-23 10:25:16-0700 [-]   File "/tools/buildbot-5f43578cba2b//lib/python2.5/site-packages/buildbot/scripts/runner.py", line 1016, in run
2009-10-23 10:25:16-0700 [-]     start(so)
2009-10-23 10:25:16-0700 [-]   File "/tools/buildbot-5f43578cba2b//lib/python2.5/site-packages/buildbot/scripts/startup.py", line 90, in start
2009-10-23 10:25:16-0700 [-]     launch(config)
2009-10-23 10:25:16-0700 [-]   File "/tools/buildbot-5f43578cba2b//lib/python2.5/site-packages/buildbot/scripts/startup.py", line 127, in launch
2009-10-23 10:25:16-0700 [-]     run()
2009-10-23 10:25:16-0700 [-]   File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/scripts/twistd.py", line 24, in run
2009-10-23 10:25:16-0700 [-]     app.run(runApp, ServerOptions)
2009-10-23 10:25:16-0700 [-]   File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/application/app.py", line 599, in run
2009-10-23 10:25:16-0700 [-]     runApp(config)
2009-10-23 10:25:16-0700 [-]   File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/scripts/twistd.py", line 20, in runApp
2009-10-23 10:25:16-0700 [-]     _SomeApplicationRunner(config).run()
2009-10-23 10:25:16-0700 [-]   File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/application/app.py", line 333, in run
2009-10-23 10:25:16-0700 [-]     self.postApplication()
2009-10-23 10:25:16-0700 [-]   File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/scripts/_twistd_unix.py", line 261, in postApplication
2009-10-23 10:25:16-0700 [-]     self.profiler)
2009-10-23 10:25:16-0700 [-]   File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/application/app.py", line 275, in runReactorWithLogging
2009-10-23 10:25:16-0700 [-]     traceback.print_exc(file=file)
2009-10-23 10:25:16-0700 [-]   File "/tools/python-2.5.2/lib/python2.5/traceback.py", line 227, in print_exc
2009-10-23 10:25:16-0700 [-]     print_exception(etype, value, tb, limit, file)
2009-10-23 10:25:16-0700 [-]   File "/tools/python-2.5.2/lib/python2.5/traceback.py", line 125, in print_exception
2009-10-23 10:25:16-0700 [-]     print_tb(tb, limit, file)
2009-10-23 10:25:16-0700 [-]   File "/tools/python-2.5.2/lib/python2.5/traceback.py", line 69, in print_tb
2009-10-23 10:25:16-0700 [-]     line = linecache.getline(filename, lineno, f.f_globals)
2009-10-23 10:25:16-0700 [-]   File "/tools/python-2.5.2/lib/python2.5/linecache.py", line 14, in getline
2009-10-23 10:25:16-0700 [-]     lines = getlines(filename, module_globals)
2009-10-23 10:25:16-0700 [-]   File "/tools/python-2.5.2/lib/python2.5/linecache.py", line 40, in getlines
2009-10-23 10:25:16-0700 [-]     return updatecache(filename, module_globals)
2009-10-23 10:25:16-0700 [-]   File "/tools/python-2.5.2/lib/python2.5/linecache.py", line 129, in updatecache
2009-10-23 10:25:16-0700 [-]     lines = fp.readlines()
2009-10-23 10:25:16-0700 [-] MemoryError
It would also be nice to get notified of exceptions that happen on slaves.
Summary: Detect hung slaves → Detect hung slaves, exceptions on slaves
We can detect old versions of twistd.log on windows by adding this to nsc.ini in the [NRPE] section:

allow_nasty_meta_chars=1

And then running this check from the server:
./check_nrpe -H $HOST -c CheckFile -a file="E:\\builds\\slave\\twistd.log" filter-written=\>12h MaxCrit=1 syntax="%filename% last changed %write%"

We'll have to make sure we have consistent slave directory names though.
Summary: Detect hung slaves, exceptions on slaves → Detect hung slaves
While cleaning up from bug 528670, found 
* moz2-darwin9-slave22 hung since Nov 12th
* moz2-linux-slave39,40,48 hung since Nov 12th
* moz2-linux-slave49 never completed connection on Nov 10th
All connected to pm02.
when this lands, we can then add checks such as,
./check_nrpe -H $HOST -c check_file_age -a 7200 21600 /builds/slave/twistd.log

to the nagios server
Attachment #426540 - Flags: review?(bhearsum)
Attached patch add nagios configs on mac (obsolete) — Splinter Review
Attachment #426566 - Flags: review?(bhearsum)
Attached file nrpe.cfg for mac
Attachment #426568 - Flags: review?(bhearsum)
This adds the 'allow_nasty_meta_chars' option to the existing nsc.ini file.
Attachment #426569 - Flags: review?(bhearsum)
Attachment #426540 - Flags: review?(bhearsum) → review+
Attachment #426566 - Flags: review?(bhearsum) → review+
Attachment #426568 - Flags: review?(bhearsum) → review+
Comment on attachment 426569 [details] [diff] [review]
OPSI package for managing nsc.ini on windows

What here allows us to check file age?
Comment on attachment 426540 [details] [diff] [review]
add check_file_age check

cvs commit centos5/etc/nagios/nrpe.cfg darwin9/nrpe.cfg
Checking in centos5/etc/nagios/nrpe.cfg;
/mofo/puppet-files/centos5/etc/nagios/nrpe.cfg,v  <--  nrpe.cfg
new revision: 1.2; previous revision: 1.1
done
RCS file: /mofo/puppet-files/darwin9/nrpe.cfg,v
done
Checking in darwin9/nrpe.cfg;
/mofo/puppet-files/darwin9/nrpe.cfg,v  <--  nrpe.cfg
initial revision: 1.1
done
Attachment #426540 - Flags: checked-in+
Comment on attachment 426568 [details]
nrpe.cfg for mac

cvs commit centos5/etc/nagios/nrpe.cfg darwin9/nrpe.cfg
Checking in centos5/etc/nagios/nrpe.cfg;
/mofo/puppet-files/centos5/etc/nagios/nrpe.cfg,v  <--  nrpe.cfg
new revision: 1.2; previous revision: 1.1
done
RCS file: /mofo/puppet-files/darwin9/nrpe.cfg,v
done
Checking in darwin9/nrpe.cfg;
/mofo/puppet-files/darwin9/nrpe.cfg,v  <--  nrpe.cfg
initial revision: 1.1
done
Attachment #426568 - Flags: checked-in+
(In reply to comment #27)
> (From update of attachment 426569 [details] [diff] [review])
> What here allows us to check file age?

We can add a command in the nagios server to do this:
check_nrpe -H $HOST -c CheckFile -a file="E:\\builds\\slave\\twistd.log"
filter-written=\>12h MaxCrit=1 syntax="%filename% last changed %write%"

CheckFile is already enabled via the CheckDisk module I believe.
Assignee: catlee → nobody
catlee - you seem to be actively working on this, should it be assigned to you?
Assignee: nobody → raliiev
bhearsum: ping on the r?
Attachment #426569 - Flags: review?(bhearsum) → review+
Attaching refreshed copy.

Looks like the corresponding file (nrpe.cfg) and symlink have been already deployed (by hand?), so it would be better if we deploy it via puppet.
Attachment #426566 - Attachment is obsolete: true
Attachment #435610 - Flags: review?(catlee)
Attachment #435610 - Flags: checked-in?
Comment on attachment 435610 [details] [diff] [review]
add nagios configs on mac (refreshed)

${fileroot}darwin9/nrpe.cfg should probably be changed to ${fileroot}darwin-shared/nrpe.cfg.

Other than that, looks ok to me, but I know a lot has happened with osx ref platforms.
Attachment #435610 - Flags: review?(catlee)
Attachment #435610 - Flags: review?(bhearsum)
Attachment #435610 - Flags: review+
Comment on attachment 435610 [details] [diff] [review]
add nagios configs on mac (refreshed)

Yeah, please move this to darwin-shared. r=me otherwise
(In reply to comment #35)
> (From update of attachment 435610 [details] [diff] [review])
> Yeah, please move this to darwin-shared. r=me otherwise

Could anybody move the corresponding file from darwin9 to darwin-shared? The only reason why I didn't changed the path to darwin-shared one was the file which is in darwin9.
Could you review the following patch for nscp opsi package?

I've just added winbatch_nscp_restart to restart nscp service and reread the configuration file.
Attachment #436135 - Flags: review?(bhearsum)
Attachment #435610 - Attachment is obsolete: true
Attachment #436159 - Flags: review?(bhearsum)
Attachment #435610 - Flags: review?(bhearsum)
Attachment #435610 - Flags: checked-in?
(In reply to comment #36)
> (In reply to comment #35)
> > (From update of attachment 435610 [details] [diff] [review] [details])
> > Yeah, please move this to darwin-shared. r=me otherwise
> 
> Could anybody move the corresponding file from darwin9 to darwin-shared? The
> only reason why I didn't changed the path to darwin-shared one was the file
> which is in darwin9.

Done.

[root@staging-puppet puppet-files]# cvs ci -m "Bug 523827. Move darwin9/nrpe.cfg to darwin-shared/nrpe.cfg."
RCS file: /mofo/puppet-files/darwin-shared/nrpe.cfg,v
done
Checking in darwin-shared/nrpe.cfg;
/mofo/puppet-files/darwin-shared/nrpe.cfg,v  <--  nrpe.cfg
initial revision: 1.1
done
Removing darwin9/nrpe.cfg;
/mofo/puppet-files/darwin9/nrpe.cfg,v  <--  nrpe.cfg
new revision: delete; previous revision: 1.1
done
Attachment #436159 - Flags: review?(bhearsum) → review+
Comment on attachment 436135 [details] [diff] [review]
OPSI package for managing nsc.ini on windows + nscp restart

This is incomplete; You're missing the entire OPSI/ directory.

Does %ProgramFilesDir% actually work? It doesn't from a cmd prompt. You can also just use 'net stop/start nsclientpp' to control the service.

Have you been able to test this in staging? If not, someone should show you how to.
Attachment #436135 - Flags: review?(bhearsum) → review-
(In reply to comment #40)
> (From update of attachment 436135 [details] [diff] [review])
> This is incomplete; You're missing the entire OPSI/ directory.

Ooops, forgot to hg add it.

> Does %ProgramFilesDir% actually work? It doesn't from a cmd prompt. 

Do you mean DosBatch/DosInAction?
I guess, it should work for winbatch.

> You can
> also just use 'net stop/start nsclientpp' to control the service.

Good idea.

> Have you been able to test this in staging? If not, someone should show you how
> to.

Not tested at all. Attaching the refreshed patch to be tested in staging.
Attachment #436135 - Attachment is obsolete: true
Priority: -- → P2
Attachment #436159 - Flags: checked-in?
Attachment #436654 - Attachment is obsolete: true
Comment on attachment 426569 [details] [diff] [review]
OPSI package for managing nsc.ini on windows

I tested this package in staging. worked as expected.

Scenario:
1) create an OPSI package, install to opsi-staging.
2) deploy on win32-slave03
3) Compare old and new file (zero diff)
4) shut down buildbot and wait for Nagios CRITICAL alert
5) start buildbot and wait for Nagios OK alert
Attachment #426569 - Flags: checked-in?
Comment on attachment 426569 [details] [diff] [review]
OPSI package for managing nsc.ini on windows

changeset:   47:554e579160cc
Attachment #426569 - Flags: checked-in? → checked-in+
I set the OPSI package to roll out on all of the Windows build machines.
Comment on attachment 436159 [details] [diff] [review]
add nagios configs on mac (refreshed, use darwin-shared)

changeset:   129:149711c89aee
Attachment #436159 - Flags: checked-in? → checked-in+
The Mac machines should pick up the puppet file changes on next reboot.
Most of the Windows machines have rebooted and got these changes. I also updated the ref machines.
Depends on: 564404
Whiteboard: [buildslaves]
Now we have these checks enabled and working. Closing.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: