Status

P2
major
RESOLVED FIXED
9 years ago
5 years ago

People

(Reporter: catlee, Assigned: rail)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [buildslaves] )

Attachments

(4 attachments, 4 obsolete attachments)

(Reporter)

Description

9 years ago
While trying to deploy a change to production-master this morning, I discovered at least 4 slaves were in a "stuck" state:

moz2-win32-slave18 was gracefully shutting down since the 19th
bm-xserve16 was doing a debug test from the 12th
bm-xserve18 was doing a debug test from the 14th
bm-xserve19 was doing a debug test from the 6th
One possible impl: a nagios check that watches /buildslaves periodically and looks for anything above a certain "last heard from" threshold. AFAICT these slaves were all in the "days ago" last heard from area.
(In reply to comment #1)
> One possible impl: a nagios check that watches /buildslaves periodically and
> looks for anything above a certain "last heard from" threshold. AFAICT these
> slaves were all in the "days ago" last heard from area.

Nevermind. This doesn't seem true for all of them.
I have a scraper for this purpose, it runs for the mobile master's buildslave page and detects when the slave was last heard from.  I have the threshold for mobile at 60 minutes.  Because of the PM1-PM2 split, the most useful section of this list is going to be the top "ATTENTION NEEDED LIST" section.  It puts output to http://production-mobile-master.mv.mozilla.com/n810-status (Mozilla MV VPN required) and some sample output at http://pastebin.mozilla.org/678275
Yeah, unfortunately, not all of the hung slaves have a high "last heard from". This could help for some, though.
is there an indicator on the buildslaves page that i could capture?
(In reply to comment #5)
> is there an indicator on the buildslaves page that i could capture?

NOt on /buildslaves. The way catlee discovered some of these was to look at /buildslaves/$slave and noticed that a build was running from days ago.
hmm, that would be possible with this script but would definately put a lot of load on the buildbot masters to generate buildslaves and each buildslave/$slave page for connected machines.

out of curiosity, what was the last heard from time on these?
(In reply to comment #7)
> hmm, that would be possible with this script but would definately put a lot of
> load on the buildbot masters to generate buildslaves and each buildslave/$slave
> page for connected machines.

Yeah, I'm concerned about this too. I think we're ok only polling every 4 or 6 hours or so, though.

> out of curiosity, what was the last heard from time on these?

0 seconds
(In reply to comment #3)
> I have a scraper for this purpose, it runs for the mobile master's buildslave
> page and detects when the slave was last heard from.

Any chance you can post that script somewhere? I'd love to use it as well.
(In reply to comment #7)
> hmm, that would be possible with this script but would definately put a lot of
> load on the buildbot masters to generate buildslaves and each buildslave/$slave
> page for connected machines.

The way we have pm/pm02 configured with every slave set up on both masters makes this a no go - it takes tens of seconds for the master to try to find previous builds for machines that were never connected (or got moved more than a few days ago). You can put ?builds=0 on the slave url (which means current build plus the last one) to make that less painful but it's still blocks any other http requests to the master. Perhaps there is something else buried in buildbot/status/web we can use instead.
(Reporter)

Comment 11

9 years ago
Could we do something with nagios inspecting the twistd.log file on the slave and complaining if it's too old?
Philippe, it is on my personal HG repo http://hg.johnford.info/buildslaves-scrape/

Nick, You'll notice that we have 4 environments in a similar configuration to pm/pm02.  The action needed list is across all of the 4 buildbot masters we have for mobile

Chris, that sounds like a good idea and a lot less master-intrusive than scraping buildslaves.  I would be worried that as the last heard from time was 0 seconds that it is still writing to the logs.  Is this the case?  Maybe we could look for something like "has the last buildstep on this slave started more than 8 hours ago?"
gah, i haven't been able to push the lastest copy of that script.  When i remember where the latest copy is I will push it to my HG server
(In reply to comment #12)
> Nick, You'll notice that we have 4 environments in a similar configuration to
> pm/pm02.  The action needed list is across all of the 4 buildbot masters we
> have for mobile

Yeah, looks great for the mobile masters. I just think we'll kill pm if we poll the buildslaves and buildsslaves/$slave a lot.

(In reply to comment #11)
> Could we do something with nagios inspecting the twistd.log file on the slave
> and complaining if it's too old?

I like this idea a bunch. Looking at some slaves that were last heard from 10 to 12 hours ago (linux on pm02 are good for this) there's no heartbeat of keepalive messages there though. Just re-attaching to the builders after the reboot, and a 
 2009/10/22 11:15 PDT [Broker,client] I have a leftover directory 'linux_repack' that is not being used by the buildmaster: you can delete it now
which I don't think we can rely on. So a nagios check for a twistd.log that hasn't changed in a .... day ??
Component: Release Engineering → Release Engineering: Future
(Reporter)

Comment 15

9 years ago
I think this needs to be done soon.
Component: Release Engineering: Future → Release Engineering
Moving to future as it isn't causing redness
Component: Release Engineering → Release Engineering: Future
Yes, this needs to be done soon. When this happens, it causes slaves to silently hang, and never accept new jobs. Over time, this means that we have fewer slaves actually processing jobs, which is bad for wait times. 

However, until we have someone who has time to work on this, it seems Future is accurate place for it.
(Reporter)

Updated

9 years ago
Assignee: nobody → catlee
Component: Release Engineering: Future → Release Engineering
(Reporter)

Comment 18

9 years ago
bm-xserve19 stuck doing a m-c debug everything else test run (crashtest) since the 23rd.
(Reporter)

Comment 19

9 years ago
here's the last traceback on the slave:

2009-10-23 10:25:16-0700 [-] Unexpected error in main loop.
2009-10-23 10:25:16-0700 [-] Traceback (most recent call last):
2009-10-23 10:25:16-0700 [-]   File "/tools/buildbot/bin/buildbot", line 4, in <module>
2009-10-23 10:25:16-0700 [-]     runner.run()
2009-10-23 10:25:16-0700 [-]   File "/tools/buildbot-5f43578cba2b//lib/python2.5/site-packages/buildbot/scripts/runner.py", line 1016, in run
2009-10-23 10:25:16-0700 [-]     start(so)
2009-10-23 10:25:16-0700 [-]   File "/tools/buildbot-5f43578cba2b//lib/python2.5/site-packages/buildbot/scripts/startup.py", line 90, in start
2009-10-23 10:25:16-0700 [-]     launch(config)
2009-10-23 10:25:16-0700 [-]   File "/tools/buildbot-5f43578cba2b//lib/python2.5/site-packages/buildbot/scripts/startup.py", line 127, in launch
2009-10-23 10:25:16-0700 [-]     run()
2009-10-23 10:25:16-0700 [-]   File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/scripts/twistd.py", line 24, in run
2009-10-23 10:25:16-0700 [-]     app.run(runApp, ServerOptions)
2009-10-23 10:25:16-0700 [-]   File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/application/app.py", line 599, in run
2009-10-23 10:25:16-0700 [-]     runApp(config)
2009-10-23 10:25:16-0700 [-]   File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/scripts/twistd.py", line 20, in runApp
2009-10-23 10:25:16-0700 [-]     _SomeApplicationRunner(config).run()
2009-10-23 10:25:16-0700 [-]   File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/application/app.py", line 333, in run
2009-10-23 10:25:16-0700 [-]     self.postApplication()
2009-10-23 10:25:16-0700 [-]   File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/scripts/_twistd_unix.py", line 261, in postApplication
2009-10-23 10:25:16-0700 [-]     self.profiler)
2009-10-23 10:25:16-0700 [-]   File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/application/app.py", line 275, in runReactorWithLogging
2009-10-23 10:25:16-0700 [-]     traceback.print_exc(file=file)
2009-10-23 10:25:16-0700 [-]   File "/tools/python-2.5.2/lib/python2.5/traceback.py", line 227, in print_exc
2009-10-23 10:25:16-0700 [-]     print_exception(etype, value, tb, limit, file)
2009-10-23 10:25:16-0700 [-]   File "/tools/python-2.5.2/lib/python2.5/traceback.py", line 125, in print_exception
2009-10-23 10:25:16-0700 [-]     print_tb(tb, limit, file)
2009-10-23 10:25:16-0700 [-]   File "/tools/python-2.5.2/lib/python2.5/traceback.py", line 69, in print_tb
2009-10-23 10:25:16-0700 [-]     line = linecache.getline(filename, lineno, f.f_globals)
2009-10-23 10:25:16-0700 [-]   File "/tools/python-2.5.2/lib/python2.5/linecache.py", line 14, in getline
2009-10-23 10:25:16-0700 [-]     lines = getlines(filename, module_globals)
2009-10-23 10:25:16-0700 [-]   File "/tools/python-2.5.2/lib/python2.5/linecache.py", line 40, in getlines
2009-10-23 10:25:16-0700 [-]     return updatecache(filename, module_globals)
2009-10-23 10:25:16-0700 [-]   File "/tools/python-2.5.2/lib/python2.5/linecache.py", line 129, in updatecache
2009-10-23 10:25:16-0700 [-]     lines = fp.readlines()
2009-10-23 10:25:16-0700 [-] MemoryError
(Reporter)

Comment 20

9 years ago
It would also be nice to get notified of exceptions that happen on slaves.
Summary: Detect hung slaves → Detect hung slaves, exceptions on slaves
(Reporter)

Comment 21

9 years ago
We can detect old versions of twistd.log on windows by adding this to nsc.ini in the [NRPE] section:

allow_nasty_meta_chars=1

And then running this check from the server:
./check_nrpe -H $HOST -c CheckFile -a file="E:\\builds\\slave\\twistd.log" filter-written=\>12h MaxCrit=1 syntax="%filename% last changed %write%"

We'll have to make sure we have consistent slave directory names though.
(Reporter)

Updated

9 years ago
Summary: Detect hung slaves, exceptions on slaves → Detect hung slaves
While cleaning up from bug 528670, found 
* moz2-darwin9-slave22 hung since Nov 12th
* moz2-linux-slave39,40,48 hung since Nov 12th
* moz2-linux-slave49 never completed connection on Nov 10th
All connected to pm02.
(Reporter)

Comment 23

9 years ago
Created attachment 426540 [details] [diff] [review]
add check_file_age check

when this lands, we can then add checks such as,
./check_nrpe -H $HOST -c check_file_age -a 7200 21600 /builds/slave/twistd.log

to the nagios server
Attachment #426540 - Flags: review?(bhearsum)
(Reporter)

Comment 24

9 years ago
Created attachment 426566 [details] [diff] [review]
add nagios configs on mac
Attachment #426566 - Flags: review?(bhearsum)
(Reporter)

Comment 25

9 years ago
Created attachment 426568 [details]
nrpe.cfg for mac
Attachment #426568 - Flags: review?(bhearsum)
(Reporter)

Comment 26

9 years ago
Created attachment 426569 [details] [diff] [review]
OPSI package for managing nsc.ini on windows

This adds the 'allow_nasty_meta_chars' option to the existing nsc.ini file.
Attachment #426569 - Flags: review?(bhearsum)
Attachment #426540 - Flags: review?(bhearsum) → review+
Attachment #426566 - Flags: review?(bhearsum) → review+
Attachment #426568 - Flags: review?(bhearsum) → review+
Comment on attachment 426569 [details] [diff] [review]
OPSI package for managing nsc.ini on windows

What here allows us to check file age?
(Reporter)

Comment 28

9 years ago
Comment on attachment 426540 [details] [diff] [review]
add check_file_age check

cvs commit centos5/etc/nagios/nrpe.cfg darwin9/nrpe.cfg
Checking in centos5/etc/nagios/nrpe.cfg;
/mofo/puppet-files/centos5/etc/nagios/nrpe.cfg,v  <--  nrpe.cfg
new revision: 1.2; previous revision: 1.1
done
RCS file: /mofo/puppet-files/darwin9/nrpe.cfg,v
done
Checking in darwin9/nrpe.cfg;
/mofo/puppet-files/darwin9/nrpe.cfg,v  <--  nrpe.cfg
initial revision: 1.1
done
Attachment #426540 - Flags: checked-in+
(Reporter)

Comment 29

9 years ago
Comment on attachment 426568 [details]
nrpe.cfg for mac

cvs commit centos5/etc/nagios/nrpe.cfg darwin9/nrpe.cfg
Checking in centos5/etc/nagios/nrpe.cfg;
/mofo/puppet-files/centos5/etc/nagios/nrpe.cfg,v  <--  nrpe.cfg
new revision: 1.2; previous revision: 1.1
done
RCS file: /mofo/puppet-files/darwin9/nrpe.cfg,v
done
Checking in darwin9/nrpe.cfg;
/mofo/puppet-files/darwin9/nrpe.cfg,v  <--  nrpe.cfg
initial revision: 1.1
done
Attachment #426568 - Flags: checked-in+
(Reporter)

Comment 30

9 years ago
(In reply to comment #27)
> (From update of attachment 426569 [details] [diff] [review])
> What here allows us to check file age?

We can add a command in the nagios server to do this:
check_nrpe -H $HOST -c CheckFile -a file="E:\\builds\\slave\\twistd.log"
filter-written=\>12h MaxCrit=1 syntax="%filename% last changed %write%"

CheckFile is already enabled via the CheckDisk module I believe.
(Reporter)

Updated

9 years ago
Assignee: catlee → nobody
catlee - you seem to be actively working on this, should it be assigned to you?
(Assignee)

Updated

9 years ago
Assignee: nobody → raliiev
(Reporter)

Comment 32

9 years ago
bhearsum: ping on the r?
Attachment #426569 - Flags: review?(bhearsum) → review+
(Assignee)

Comment 33

9 years ago
Created attachment 435610 [details] [diff] [review]
add nagios configs on mac (refreshed)

Attaching refreshed copy.

Looks like the corresponding file (nrpe.cfg) and symlink have been already deployed (by hand?), so it would be better if we deploy it via puppet.
Attachment #426566 - Attachment is obsolete: true
Attachment #435610 - Flags: review?(catlee)
Attachment #435610 - Flags: checked-in?
(Reporter)

Comment 34

9 years ago
Comment on attachment 435610 [details] [diff] [review]
add nagios configs on mac (refreshed)

${fileroot}darwin9/nrpe.cfg should probably be changed to ${fileroot}darwin-shared/nrpe.cfg.

Other than that, looks ok to me, but I know a lot has happened with osx ref platforms.
Attachment #435610 - Flags: review?(catlee)
Attachment #435610 - Flags: review?(bhearsum)
Attachment #435610 - Flags: review+
Comment on attachment 435610 [details] [diff] [review]
add nagios configs on mac (refreshed)

Yeah, please move this to darwin-shared. r=me otherwise
(Assignee)

Comment 36

9 years ago
(In reply to comment #35)
> (From update of attachment 435610 [details] [diff] [review])
> Yeah, please move this to darwin-shared. r=me otherwise

Could anybody move the corresponding file from darwin9 to darwin-shared? The only reason why I didn't changed the path to darwin-shared one was the file which is in darwin9.
(Assignee)

Comment 37

9 years ago
Created attachment 436135 [details] [diff] [review]
OPSI package for managing nsc.ini on windows + nscp restart

Could you review the following patch for nscp opsi package?

I've just added winbatch_nscp_restart to restart nscp service and reread the configuration file.
Attachment #436135 - Flags: review?(bhearsum)
(Assignee)

Comment 38

9 years ago
Created attachment 436159 [details] [diff] [review]
add nagios configs on mac (refreshed, use darwin-shared)
Attachment #435610 - Attachment is obsolete: true
Attachment #436159 - Flags: review?(bhearsum)
Attachment #435610 - Flags: review?(bhearsum)
Attachment #435610 - Flags: checked-in?

Comment 39

9 years ago
(In reply to comment #36)
> (In reply to comment #35)
> > (From update of attachment 435610 [details] [diff] [review] [details])
> > Yeah, please move this to darwin-shared. r=me otherwise
> 
> Could anybody move the corresponding file from darwin9 to darwin-shared? The
> only reason why I didn't changed the path to darwin-shared one was the file
> which is in darwin9.

Done.

[root@staging-puppet puppet-files]# cvs ci -m "Bug 523827. Move darwin9/nrpe.cfg to darwin-shared/nrpe.cfg."
RCS file: /mofo/puppet-files/darwin-shared/nrpe.cfg,v
done
Checking in darwin-shared/nrpe.cfg;
/mofo/puppet-files/darwin-shared/nrpe.cfg,v  <--  nrpe.cfg
initial revision: 1.1
done
Removing darwin9/nrpe.cfg;
/mofo/puppet-files/darwin9/nrpe.cfg,v  <--  nrpe.cfg
new revision: delete; previous revision: 1.1
done
Attachment #436159 - Flags: review?(bhearsum) → review+
Comment on attachment 436135 [details] [diff] [review]
OPSI package for managing nsc.ini on windows + nscp restart

This is incomplete; You're missing the entire OPSI/ directory.

Does %ProgramFilesDir% actually work? It doesn't from a cmd prompt. You can also just use 'net stop/start nsclientpp' to control the service.

Have you been able to test this in staging? If not, someone should show you how to.
Attachment #436135 - Flags: review?(bhearsum) → review-
(Assignee)

Comment 41

9 years ago
Created attachment 436654 [details] [diff] [review]
OPSI package for managing nsc.ini on windows + nscp restart

(In reply to comment #40)
> (From update of attachment 436135 [details] [diff] [review])
> This is incomplete; You're missing the entire OPSI/ directory.

Ooops, forgot to hg add it.

> Does %ProgramFilesDir% actually work? It doesn't from a cmd prompt. 

Do you mean DosBatch/DosInAction?
I guess, it should work for winbatch.

> You can
> also just use 'net stop/start nsclientpp' to control the service.

Good idea.

> Have you been able to test this in staging? If not, someone should show you how
> to.

Not tested at all. Attaching the refreshed patch to be tested in staging.
Attachment #436135 - Attachment is obsolete: true
(Assignee)

Updated

9 years ago
Priority: -- → P2
(Assignee)

Updated

9 years ago
Attachment #436159 - Flags: checked-in?
(Assignee)

Updated

9 years ago
Attachment #436654 - Attachment is obsolete: true
(Assignee)

Comment 42

9 years ago
Comment on attachment 426569 [details] [diff] [review]
OPSI package for managing nsc.ini on windows

I tested this package in staging. worked as expected.

Scenario:
1) create an OPSI package, install to opsi-staging.
2) deploy on win32-slave03
3) Compare old and new file (zero diff)
4) shut down buildbot and wait for Nagios CRITICAL alert
5) start buildbot and wait for Nagios OK alert
Attachment #426569 - Flags: checked-in?
Comment on attachment 426569 [details] [diff] [review]
OPSI package for managing nsc.ini on windows

changeset:   47:554e579160cc
Attachment #426569 - Flags: checked-in? → checked-in+
I set the OPSI package to roll out on all of the Windows build machines.
Comment on attachment 436159 [details] [diff] [review]
add nagios configs on mac (refreshed, use darwin-shared)

changeset:   129:149711c89aee
Attachment #436159 - Flags: checked-in? → checked-in+
The Mac machines should pick up the puppet file changes on next reboot.
Most of the Windows machines have rebooted and got these changes. I also updated the ref machines.
(Assignee)

Updated

9 years ago
Depends on: 564404
(Assignee)

Updated

9 years ago
Whiteboard: [buildslaves]
(Assignee)

Comment 48

8 years ago
Now we have these checks enabled and working. Closing.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.