Closed
Bug 523827
Opened 15 years ago
Closed 14 years ago
Detect hung slaves
Categories
(Release Engineering :: General, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Assigned: rail)
References
Details
(Whiteboard: [buildslaves] )
Attachments
(4 files, 4 obsolete files)
1.11 KB,
patch
|
bhearsum
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
6.55 KB,
text/plain
|
bhearsum
:
review+
catlee
:
checked-in+
|
Details |
6.06 KB,
patch
|
bhearsum
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
1.02 KB,
patch
|
bhearsum
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
While trying to deploy a change to production-master this morning, I discovered at least 4 slaves were in a "stuck" state: moz2-win32-slave18 was gracefully shutting down since the 19th bm-xserve16 was doing a debug test from the 12th bm-xserve18 was doing a debug test from the 14th bm-xserve19 was doing a debug test from the 6th
Comment 1•15 years ago
|
||
One possible impl: a nagios check that watches /buildslaves periodically and looks for anything above a certain "last heard from" threshold. AFAICT these slaves were all in the "days ago" last heard from area.
Comment 2•15 years ago
|
||
(In reply to comment #1) > One possible impl: a nagios check that watches /buildslaves periodically and > looks for anything above a certain "last heard from" threshold. AFAICT these > slaves were all in the "days ago" last heard from area. Nevermind. This doesn't seem true for all of them.
Comment 3•15 years ago
|
||
I have a scraper for this purpose, it runs for the mobile master's buildslave page and detects when the slave was last heard from. I have the threshold for mobile at 60 minutes. Because of the PM1-PM2 split, the most useful section of this list is going to be the top "ATTENTION NEEDED LIST" section. It puts output to http://production-mobile-master.mv.mozilla.com/n810-status (Mozilla MV VPN required) and some sample output at http://pastebin.mozilla.org/678275
Comment 4•15 years ago
|
||
Yeah, unfortunately, not all of the hung slaves have a high "last heard from". This could help for some, though.
Comment 5•15 years ago
|
||
is there an indicator on the buildslaves page that i could capture?
Comment 6•15 years ago
|
||
(In reply to comment #5) > is there an indicator on the buildslaves page that i could capture? NOt on /buildslaves. The way catlee discovered some of these was to look at /buildslaves/$slave and noticed that a build was running from days ago.
Comment 7•15 years ago
|
||
hmm, that would be possible with this script but would definately put a lot of load on the buildbot masters to generate buildslaves and each buildslave/$slave page for connected machines. out of curiosity, what was the last heard from time on these?
Comment 8•15 years ago
|
||
(In reply to comment #7) > hmm, that would be possible with this script but would definately put a lot of > load on the buildbot masters to generate buildslaves and each buildslave/$slave > page for connected machines. Yeah, I'm concerned about this too. I think we're ok only polling every 4 or 6 hours or so, though. > out of curiosity, what was the last heard from time on these? 0 seconds
Comment 9•15 years ago
|
||
(In reply to comment #3) > I have a scraper for this purpose, it runs for the mobile master's buildslave > page and detects when the slave was last heard from. Any chance you can post that script somewhere? I'd love to use it as well.
Comment 10•15 years ago
|
||
(In reply to comment #7) > hmm, that would be possible with this script but would definately put a lot of > load on the buildbot masters to generate buildslaves and each buildslave/$slave > page for connected machines. The way we have pm/pm02 configured with every slave set up on both masters makes this a no go - it takes tens of seconds for the master to try to find previous builds for machines that were never connected (or got moved more than a few days ago). You can put ?builds=0 on the slave url (which means current build plus the last one) to make that less painful but it's still blocks any other http requests to the master. Perhaps there is something else buried in buildbot/status/web we can use instead.
Reporter | ||
Comment 11•15 years ago
|
||
Could we do something with nagios inspecting the twistd.log file on the slave and complaining if it's too old?
Comment 12•15 years ago
|
||
Philippe, it is on my personal HG repo http://hg.johnford.info/buildslaves-scrape/ Nick, You'll notice that we have 4 environments in a similar configuration to pm/pm02. The action needed list is across all of the 4 buildbot masters we have for mobile Chris, that sounds like a good idea and a lot less master-intrusive than scraping buildslaves. I would be worried that as the last heard from time was 0 seconds that it is still writing to the logs. Is this the case? Maybe we could look for something like "has the last buildstep on this slave started more than 8 hours ago?"
Comment 13•15 years ago
|
||
gah, i haven't been able to push the lastest copy of that script. When i remember where the latest copy is I will push it to my HG server
Comment 14•15 years ago
|
||
(In reply to comment #12) > Nick, You'll notice that we have 4 environments in a similar configuration to > pm/pm02. The action needed list is across all of the 4 buildbot masters we > have for mobile Yeah, looks great for the mobile masters. I just think we'll kill pm if we poll the buildslaves and buildsslaves/$slave a lot. (In reply to comment #11) > Could we do something with nagios inspecting the twistd.log file on the slave > and complaining if it's too old? I like this idea a bunch. Looking at some slaves that were last heard from 10 to 12 hours ago (linux on pm02 are good for this) there's no heartbeat of keepalive messages there though. Just re-attaching to the builders after the reboot, and a 2009/10/22 11:15 PDT [Broker,client] I have a leftover directory 'linux_repack' that is not being used by the buildmaster: you can delete it now which I don't think we can rely on. So a nagios check for a twistd.log that hasn't changed in a .... day ??
Updated•15 years ago
|
Component: Release Engineering → Release Engineering: Future
Reporter | ||
Comment 15•15 years ago
|
||
I think this needs to be done soon.
Component: Release Engineering: Future → Release Engineering
Comment 16•15 years ago
|
||
Moving to future as it isn't causing redness
Component: Release Engineering → Release Engineering: Future
Comment 17•15 years ago
|
||
Yes, this needs to be done soon. When this happens, it causes slaves to silently hang, and never accept new jobs. Over time, this means that we have fewer slaves actually processing jobs, which is bad for wait times. However, until we have someone who has time to work on this, it seems Future is accurate place for it.
Reporter | ||
Updated•15 years ago
|
Assignee: nobody → catlee
Component: Release Engineering: Future → Release Engineering
Reporter | ||
Comment 18•15 years ago
|
||
bm-xserve19 stuck doing a m-c debug everything else test run (crashtest) since the 23rd.
Reporter | ||
Comment 19•15 years ago
|
||
here's the last traceback on the slave: 2009-10-23 10:25:16-0700 [-] Unexpected error in main loop. 2009-10-23 10:25:16-0700 [-] Traceback (most recent call last): 2009-10-23 10:25:16-0700 [-] File "/tools/buildbot/bin/buildbot", line 4, in <module> 2009-10-23 10:25:16-0700 [-] runner.run() 2009-10-23 10:25:16-0700 [-] File "/tools/buildbot-5f43578cba2b//lib/python2.5/site-packages/buildbot/scripts/runner.py", line 1016, in run 2009-10-23 10:25:16-0700 [-] start(so) 2009-10-23 10:25:16-0700 [-] File "/tools/buildbot-5f43578cba2b//lib/python2.5/site-packages/buildbot/scripts/startup.py", line 90, in start 2009-10-23 10:25:16-0700 [-] launch(config) 2009-10-23 10:25:16-0700 [-] File "/tools/buildbot-5f43578cba2b//lib/python2.5/site-packages/buildbot/scripts/startup.py", line 127, in launch 2009-10-23 10:25:16-0700 [-] run() 2009-10-23 10:25:16-0700 [-] File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/scripts/twistd.py", line 24, in run 2009-10-23 10:25:16-0700 [-] app.run(runApp, ServerOptions) 2009-10-23 10:25:16-0700 [-] File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/application/app.py", line 599, in run 2009-10-23 10:25:16-0700 [-] runApp(config) 2009-10-23 10:25:16-0700 [-] File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/scripts/twistd.py", line 20, in runApp 2009-10-23 10:25:16-0700 [-] _SomeApplicationRunner(config).run() 2009-10-23 10:25:16-0700 [-] File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/application/app.py", line 333, in run 2009-10-23 10:25:16-0700 [-] self.postApplication() 2009-10-23 10:25:16-0700 [-] File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/scripts/_twistd_unix.py", line 261, in postApplication 2009-10-23 10:25:16-0700 [-] self.profiler) 2009-10-23 10:25:16-0700 [-] File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/application/app.py", line 275, in runReactorWithLogging 2009-10-23 10:25:16-0700 [-] traceback.print_exc(file=file) 2009-10-23 10:25:16-0700 [-] File "/tools/python-2.5.2/lib/python2.5/traceback.py", line 227, in print_exc 2009-10-23 10:25:16-0700 [-] print_exception(etype, value, tb, limit, file) 2009-10-23 10:25:16-0700 [-] File "/tools/python-2.5.2/lib/python2.5/traceback.py", line 125, in print_exception 2009-10-23 10:25:16-0700 [-] print_tb(tb, limit, file) 2009-10-23 10:25:16-0700 [-] File "/tools/python-2.5.2/lib/python2.5/traceback.py", line 69, in print_tb 2009-10-23 10:25:16-0700 [-] line = linecache.getline(filename, lineno, f.f_globals) 2009-10-23 10:25:16-0700 [-] File "/tools/python-2.5.2/lib/python2.5/linecache.py", line 14, in getline 2009-10-23 10:25:16-0700 [-] lines = getlines(filename, module_globals) 2009-10-23 10:25:16-0700 [-] File "/tools/python-2.5.2/lib/python2.5/linecache.py", line 40, in getlines 2009-10-23 10:25:16-0700 [-] return updatecache(filename, module_globals) 2009-10-23 10:25:16-0700 [-] File "/tools/python-2.5.2/lib/python2.5/linecache.py", line 129, in updatecache 2009-10-23 10:25:16-0700 [-] lines = fp.readlines() 2009-10-23 10:25:16-0700 [-] MemoryError
Reporter | ||
Comment 20•15 years ago
|
||
It would also be nice to get notified of exceptions that happen on slaves.
Summary: Detect hung slaves → Detect hung slaves, exceptions on slaves
Reporter | ||
Comment 21•15 years ago
|
||
We can detect old versions of twistd.log on windows by adding this to nsc.ini in the [NRPE] section: allow_nasty_meta_chars=1 And then running this check from the server: ./check_nrpe -H $HOST -c CheckFile -a file="E:\\builds\\slave\\twistd.log" filter-written=\>12h MaxCrit=1 syntax="%filename% last changed %write%" We'll have to make sure we have consistent slave directory names though.
Reporter | ||
Updated•15 years ago
|
Summary: Detect hung slaves, exceptions on slaves → Detect hung slaves
Comment 22•15 years ago
|
||
While cleaning up from bug 528670, found * moz2-darwin9-slave22 hung since Nov 12th * moz2-linux-slave39,40,48 hung since Nov 12th * moz2-linux-slave49 never completed connection on Nov 10th All connected to pm02.
Reporter | ||
Comment 23•14 years ago
|
||
when this lands, we can then add checks such as, ./check_nrpe -H $HOST -c check_file_age -a 7200 21600 /builds/slave/twistd.log to the nagios server
Attachment #426540 -
Flags: review?(bhearsum)
Reporter | ||
Comment 24•14 years ago
|
||
Attachment #426566 -
Flags: review?(bhearsum)
Reporter | ||
Comment 25•14 years ago
|
||
Attachment #426568 -
Flags: review?(bhearsum)
Reporter | ||
Comment 26•14 years ago
|
||
This adds the 'allow_nasty_meta_chars' option to the existing nsc.ini file.
Attachment #426569 -
Flags: review?(bhearsum)
Updated•14 years ago
|
Attachment #426540 -
Flags: review?(bhearsum) → review+
Updated•14 years ago
|
Attachment #426566 -
Flags: review?(bhearsum) → review+
Updated•14 years ago
|
Attachment #426568 -
Flags: review?(bhearsum) → review+
Comment 27•14 years ago
|
||
Comment on attachment 426569 [details] [diff] [review] OPSI package for managing nsc.ini on windows What here allows us to check file age?
Reporter | ||
Comment 28•14 years ago
|
||
Comment on attachment 426540 [details] [diff] [review] add check_file_age check cvs commit centos5/etc/nagios/nrpe.cfg darwin9/nrpe.cfg Checking in centos5/etc/nagios/nrpe.cfg; /mofo/puppet-files/centos5/etc/nagios/nrpe.cfg,v <-- nrpe.cfg new revision: 1.2; previous revision: 1.1 done RCS file: /mofo/puppet-files/darwin9/nrpe.cfg,v done Checking in darwin9/nrpe.cfg; /mofo/puppet-files/darwin9/nrpe.cfg,v <-- nrpe.cfg initial revision: 1.1 done
Attachment #426540 -
Flags: checked-in+
Reporter | ||
Comment 29•14 years ago
|
||
Comment on attachment 426568 [details]
nrpe.cfg for mac
cvs commit centos5/etc/nagios/nrpe.cfg darwin9/nrpe.cfg
Checking in centos5/etc/nagios/nrpe.cfg;
/mofo/puppet-files/centos5/etc/nagios/nrpe.cfg,v <-- nrpe.cfg
new revision: 1.2; previous revision: 1.1
done
RCS file: /mofo/puppet-files/darwin9/nrpe.cfg,v
done
Checking in darwin9/nrpe.cfg;
/mofo/puppet-files/darwin9/nrpe.cfg,v <-- nrpe.cfg
initial revision: 1.1
done
Attachment #426568 -
Flags: checked-in+
Reporter | ||
Comment 30•14 years ago
|
||
(In reply to comment #27) > (From update of attachment 426569 [details] [diff] [review]) > What here allows us to check file age? We can add a command in the nagios server to do this: check_nrpe -H $HOST -c CheckFile -a file="E:\\builds\\slave\\twistd.log" filter-written=\>12h MaxCrit=1 syntax="%filename% last changed %write%" CheckFile is already enabled via the CheckDisk module I believe.
Reporter | ||
Updated•14 years ago
|
Assignee: catlee → nobody
Comment 31•14 years ago
|
||
catlee - you seem to be actively working on this, should it be assigned to you?
Assignee | ||
Updated•14 years ago
|
Assignee: nobody → raliiev
Reporter | ||
Comment 32•14 years ago
|
||
bhearsum: ping on the r?
Updated•14 years ago
|
Attachment #426569 -
Flags: review?(bhearsum) → review+
Assignee | ||
Comment 33•14 years ago
|
||
Attaching refreshed copy. Looks like the corresponding file (nrpe.cfg) and symlink have been already deployed (by hand?), so it would be better if we deploy it via puppet.
Attachment #426566 -
Attachment is obsolete: true
Attachment #435610 -
Flags: review?(catlee)
Attachment #435610 -
Flags: checked-in?
Reporter | ||
Comment 34•14 years ago
|
||
Comment on attachment 435610 [details] [diff] [review] add nagios configs on mac (refreshed) ${fileroot}darwin9/nrpe.cfg should probably be changed to ${fileroot}darwin-shared/nrpe.cfg. Other than that, looks ok to me, but I know a lot has happened with osx ref platforms.
Attachment #435610 -
Flags: review?(catlee)
Attachment #435610 -
Flags: review?(bhearsum)
Attachment #435610 -
Flags: review+
Comment 35•14 years ago
|
||
Comment on attachment 435610 [details] [diff] [review] add nagios configs on mac (refreshed) Yeah, please move this to darwin-shared. r=me otherwise
Assignee | ||
Comment 36•14 years ago
|
||
(In reply to comment #35) > (From update of attachment 435610 [details] [diff] [review]) > Yeah, please move this to darwin-shared. r=me otherwise Could anybody move the corresponding file from darwin9 to darwin-shared? The only reason why I didn't changed the path to darwin-shared one was the file which is in darwin9.
Assignee | ||
Comment 37•14 years ago
|
||
Could you review the following patch for nscp opsi package? I've just added winbatch_nscp_restart to restart nscp service and reread the configuration file.
Attachment #436135 -
Flags: review?(bhearsum)
Assignee | ||
Comment 38•14 years ago
|
||
Attachment #435610 -
Attachment is obsolete: true
Attachment #436159 -
Flags: review?(bhearsum)
Attachment #435610 -
Flags: review?(bhearsum)
Attachment #435610 -
Flags: checked-in?
Comment 39•14 years ago
|
||
(In reply to comment #36) > (In reply to comment #35) > > (From update of attachment 435610 [details] [diff] [review] [details]) > > Yeah, please move this to darwin-shared. r=me otherwise > > Could anybody move the corresponding file from darwin9 to darwin-shared? The > only reason why I didn't changed the path to darwin-shared one was the file > which is in darwin9. Done. [root@staging-puppet puppet-files]# cvs ci -m "Bug 523827. Move darwin9/nrpe.cfg to darwin-shared/nrpe.cfg." RCS file: /mofo/puppet-files/darwin-shared/nrpe.cfg,v done Checking in darwin-shared/nrpe.cfg; /mofo/puppet-files/darwin-shared/nrpe.cfg,v <-- nrpe.cfg initial revision: 1.1 done Removing darwin9/nrpe.cfg; /mofo/puppet-files/darwin9/nrpe.cfg,v <-- nrpe.cfg new revision: delete; previous revision: 1.1 done
Updated•14 years ago
|
Attachment #436159 -
Flags: review?(bhearsum) → review+
Comment 40•14 years ago
|
||
Comment on attachment 436135 [details] [diff] [review] OPSI package for managing nsc.ini on windows + nscp restart This is incomplete; You're missing the entire OPSI/ directory. Does %ProgramFilesDir% actually work? It doesn't from a cmd prompt. You can also just use 'net stop/start nsclientpp' to control the service. Have you been able to test this in staging? If not, someone should show you how to.
Attachment #436135 -
Flags: review?(bhearsum) → review-
Assignee | ||
Comment 41•14 years ago
|
||
(In reply to comment #40) > (From update of attachment 436135 [details] [diff] [review]) > This is incomplete; You're missing the entire OPSI/ directory. Ooops, forgot to hg add it. > Does %ProgramFilesDir% actually work? It doesn't from a cmd prompt. Do you mean DosBatch/DosInAction? I guess, it should work for winbatch. > You can > also just use 'net stop/start nsclientpp' to control the service. Good idea. > Have you been able to test this in staging? If not, someone should show you how > to. Not tested at all. Attaching the refreshed patch to be tested in staging.
Attachment #436135 -
Attachment is obsolete: true
Assignee | ||
Updated•14 years ago
|
Priority: -- → P2
Assignee | ||
Updated•14 years ago
|
Attachment #436159 -
Flags: checked-in?
Assignee | ||
Updated•14 years ago
|
Attachment #436654 -
Attachment is obsolete: true
Assignee | ||
Comment 42•14 years ago
|
||
Comment on attachment 426569 [details] [diff] [review] OPSI package for managing nsc.ini on windows I tested this package in staging. worked as expected. Scenario: 1) create an OPSI package, install to opsi-staging. 2) deploy on win32-slave03 3) Compare old and new file (zero diff) 4) shut down buildbot and wait for Nagios CRITICAL alert 5) start buildbot and wait for Nagios OK alert
Attachment #426569 -
Flags: checked-in?
Comment 43•14 years ago
|
||
Comment on attachment 426569 [details] [diff] [review] OPSI package for managing nsc.ini on windows changeset: 47:554e579160cc
Attachment #426569 -
Flags: checked-in? → checked-in+
Comment 44•14 years ago
|
||
I set the OPSI package to roll out on all of the Windows build machines.
Comment 45•14 years ago
|
||
Comment on attachment 436159 [details] [diff] [review] add nagios configs on mac (refreshed, use darwin-shared) changeset: 129:149711c89aee
Attachment #436159 -
Flags: checked-in? → checked-in+
Comment 46•14 years ago
|
||
The Mac machines should pick up the puppet file changes on next reboot.
Comment 47•14 years ago
|
||
Most of the Windows machines have rebooted and got these changes. I also updated the ref machines.
Assignee | ||
Updated•14 years ago
|
Whiteboard: [buildslaves]
Assignee | ||
Comment 48•14 years ago
|
||
Now we have these checks enabled and working. Closing.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•