Closed Bug 523827 Opened 16 years ago Closed 15 years ago

Detect hung slaves

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: catlee, Assigned: rail)

References

Details

(Whiteboard: [buildslaves] )

Attachments

(4 files, 4 obsolete files)

add check_file_age check 15 years ago Chris AtLee [:catlee] 1.11 KB, patch	bhearsum : review+ catlee : checked-in+	Details \| Diff \| Splinter Review
add nagios configs on mac 15 years ago Chris AtLee [:catlee] 959 bytes, patch	bhearsum : review+	Details \| Diff \| Splinter Review
nrpe.cfg for mac 15 years ago Chris AtLee [:catlee] 6.55 KB, text/plain	bhearsum : review+ catlee : checked-in+	Details
OPSI package for managing nsc.ini on windows 15 years ago Chris AtLee [:catlee] 6.06 KB, patch	bhearsum : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
add nagios configs on mac (refreshed) 15 years ago Rail Aliiev [:rail] 1.01 KB, patch	catlee : review+	Details \| Diff \| Splinter Review
OPSI package for managing nsc.ini on windows + nscp restart 15 years ago Rail Aliiev [:rail] 4.47 KB, patch	bhearsum : review-	Details \| Diff \| Splinter Review
add nagios configs on mac (refreshed, use darwin-shared) 15 years ago Rail Aliiev [:rail] 1.02 KB, patch	bhearsum : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
OPSI package for managing nsc.ini on windows + nscp restart 15 years ago Rail Aliiev [:rail] 6.20 KB, patch		Details \| Diff \| Splinter Review

Chris AtLee [:catlee]

Reporter

Description

•

16 years ago

While trying to deploy a change to production-master this morning, I discovered at least 4 slaves were in a "stuck" state: moz2-win32-slave18 was gracefully shutting down since the 19th bm-xserve16 was doing a debug test from the 12th bm-xserve18 was doing a debug test from the 14th bm-xserve19 was doing a debug test from the 6th

bhearsum@mozilla.com (:bhearsum)

Comment 1

•

16 years ago

One possible impl: a nagios check that watches /buildslaves periodically and looks for anything above a certain "last heard from" threshold. AFAICT these slaves were all in the "days ago" last heard from area.

bhearsum@mozilla.com (:bhearsum)

Comment 2

•

16 years ago

(In reply to comment #1) > One possible impl: a nagios check that watches /buildslaves periodically and > looks for anything above a certain "last heard from" threshold. AFAICT these > slaves were all in the "days ago" last heard from area. Nevermind. This doesn't seem true for all of them.

John Ford [:jhford] CET/CEST Berlin Time

Comment 3

•

16 years ago

I have a scraper for this purpose, it runs for the mobile master's buildslave page and detects when the slave was last heard from. I have the threshold for mobile at 60 minutes. Because of the PM1-PM2 split, the most useful section of this list is going to be the top "ATTENTION NEEDED LIST" section. It puts output to http://production-mobile-master.mv.mozilla.com/n810-status (Mozilla MV VPN required) and some sample output at http://pastebin.mozilla.org/678275

bhearsum@mozilla.com (:bhearsum)

Comment 4

•

16 years ago

Yeah, unfortunately, not all of the hung slaves have a high "last heard from". This could help for some, though.

John Ford [:jhford] CET/CEST Berlin Time

Comment 5

•

16 years ago

is there an indicator on the buildslaves page that i could capture?

bhearsum@mozilla.com (:bhearsum)

Comment 6

•

16 years ago

(In reply to comment #5) > is there an indicator on the buildslaves page that i could capture? NOt on /buildslaves. The way catlee discovered some of these was to look at /buildslaves/$slave and noticed that a build was running from days ago.

John Ford [:jhford] CET/CEST Berlin Time

Comment 7

•

16 years ago

hmm, that would be possible with this script but would definately put a lot of load on the buildbot masters to generate buildslaves and each buildslave/$slave page for connected machines. out of curiosity, what was the last heard from time on these?

bhearsum@mozilla.com (:bhearsum)

Comment 8

•

16 years ago

(In reply to comment #7) > hmm, that would be possible with this script but would definately put a lot of > load on the buildbot masters to generate buildslaves and each buildslave/$slave > page for connected machines. Yeah, I'm concerned about this too. I think we're ok only polling every 4 or 6 hours or so, though. > out of curiosity, what was the last heard from time on these? 0 seconds

Philippe M. Chiasson (:gozer)

Comment 9

•

16 years ago

(In reply to comment #3) > I have a scraper for this purpose, it runs for the mobile master's buildslave > page and detects when the slave was last heard from. Any chance you can post that script somewhere? I'd love to use it as well.

Nick Thomas [:nthomas] (UTC+12)

Comment 10

•

16 years ago

(In reply to comment #7) > hmm, that would be possible with this script but would definately put a lot of > load on the buildbot masters to generate buildslaves and each buildslave/$slave > page for connected machines. The way we have pm/pm02 configured with every slave set up on both masters makes this a no go - it takes tens of seconds for the master to try to find previous builds for machines that were never connected (or got moved more than a few days ago). You can put ?builds=0 on the slave url (which means current build plus the last one) to make that less painful but it's still blocks any other http requests to the master. Perhaps there is something else buried in buildbot/status/web we can use instead.

Chris AtLee [:catlee]

Reporter

Comment 11

•

16 years ago

Could we do something with nagios inspecting the twistd.log file on the slave and complaining if it's too old?

John Ford [:jhford] CET/CEST Berlin Time

Comment 12

•

16 years ago

Philippe, it is on my personal HG repo http://hg.johnford.info/buildslaves-scrape/ Nick, You'll notice that we have 4 environments in a similar configuration to pm/pm02. The action needed list is across all of the 4 buildbot masters we have for mobile Chris, that sounds like a good idea and a lot less master-intrusive than scraping buildslaves. I would be worried that as the last heard from time was 0 seconds that it is still writing to the logs. Is this the case? Maybe we could look for something like "has the last buildstep on this slave started more than 8 hours ago?"

John Ford [:jhford] CET/CEST Berlin Time

Comment 13

•

16 years ago

gah, i haven't been able to push the lastest copy of that script. When i remember where the latest copy is I will push it to my HG server

Nick Thomas [:nthomas] (UTC+12)

Comment 14

•

16 years ago

(In reply to comment #12) > Nick, You'll notice that we have 4 environments in a similar configuration to > pm/pm02. The action needed list is across all of the 4 buildbot masters we > have for mobile Yeah, looks great for the mobile masters. I just think we'll kill pm if we poll the buildslaves and buildsslaves/$slave a lot. (In reply to comment #11) > Could we do something with nagios inspecting the twistd.log file on the slave > and complaining if it's too old? I like this idea a bunch. Looking at some slaves that were last heard from 10 to 12 hours ago (linux on pm02 are good for this) there's no heartbeat of keepalive messages there though. Just re-attaching to the builders after the reboot, and a 2009/10/22 11:15 PDT [Broker,client] I have a leftover directory 'linux_repack' that is not being used by the buildmaster: you can delete it now which I don't think we can rely on. So a nagios check for a twistd.log that hasn't changed in a .... day ??

John Ford [:jhford] CET/CEST Berlin Time

Updated

•

16 years ago

Component: Release Engineering → Release Engineering: Future

Chris AtLee [:catlee]

Reporter

Comment 15

•

16 years ago

I think this needs to be done soon.

Component: Release Engineering: Future → Release Engineering

John Ford [:jhford] CET/CEST Berlin Time

Comment 16

•

16 years ago

Moving to future as it isn't causing redness

Component: Release Engineering → Release Engineering: Future

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 17

•

16 years ago

Yes, this needs to be done soon. When this happens, it causes slaves to silently hang, and never accept new jobs. Over time, this means that we have fewer slaves actually processing jobs, which is bad for wait times. However, until we have someone who has time to work on this, it seems Future is accurate place for it.

Chris AtLee [:catlee]

Reporter

Updated

•

16 years ago

Assignee: nobody → catlee

Component: Release Engineering: Future → Release Engineering

Chris AtLee [:catlee]

Reporter

Comment 18

•

16 years ago

bm-xserve19 stuck doing a m-c debug everything else test run (crashtest) since the 23rd.

Chris AtLee [:catlee]

Reporter

Comment 19

•

16 years ago

here's the last traceback on the slave: 2009-10-23 10:25:16-0700 [-] Unexpected error in main loop. 2009-10-23 10:25:16-0700 [-] Traceback (most recent call last): 2009-10-23 10:25:16-0700 [-] File "/tools/buildbot/bin/buildbot", line 4, in <module> 2009-10-23 10:25:16-0700 [-] runner.run() 2009-10-23 10:25:16-0700 [-] File "/tools/buildbot-5f43578cba2b//lib/python2.5/site-packages/buildbot/scripts/runner.py", line 1016, in run 2009-10-23 10:25:16-0700 [-] start(so) 2009-10-23 10:25:16-0700 [-] File "/tools/buildbot-5f43578cba2b//lib/python2.5/site-packages/buildbot/scripts/startup.py", line 90, in start 2009-10-23 10:25:16-0700 [-] launch(config) 2009-10-23 10:25:16-0700 [-] File "/tools/buildbot-5f43578cba2b//lib/python2.5/site-packages/buildbot/scripts/startup.py", line 127, in launch 2009-10-23 10:25:16-0700 [-] run() 2009-10-23 10:25:16-0700 [-] File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/scripts/twistd.py", line 24, in run 2009-10-23 10:25:16-0700 [-] app.run(runApp, ServerOptions) 2009-10-23 10:25:16-0700 [-] File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/application/app.py", line 599, in run 2009-10-23 10:25:16-0700 [-] runApp(config) 2009-10-23 10:25:16-0700 [-] File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/scripts/twistd.py", line 20, in runApp 2009-10-23 10:25:16-0700 [-] _SomeApplicationRunner(config).run() 2009-10-23 10:25:16-0700 [-] File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/application/app.py", line 333, in run 2009-10-23 10:25:16-0700 [-] self.postApplication() 2009-10-23 10:25:16-0700 [-] File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/scripts/_twistd_unix.py", line 261, in postApplication 2009-10-23 10:25:16-0700 [-] self.profiler) 2009-10-23 10:25:16-0700 [-] File "/tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/application/app.py", line 275, in runReactorWithLogging 2009-10-23 10:25:16-0700 [-] traceback.print_exc(file=file) 2009-10-23 10:25:16-0700 [-] File "/tools/python-2.5.2/lib/python2.5/traceback.py", line 227, in print_exc 2009-10-23 10:25:16-0700 [-] print_exception(etype, value, tb, limit, file) 2009-10-23 10:25:16-0700 [-] File "/tools/python-2.5.2/lib/python2.5/traceback.py", line 125, in print_exception 2009-10-23 10:25:16-0700 [-] print_tb(tb, limit, file) 2009-10-23 10:25:16-0700 [-] File "/tools/python-2.5.2/lib/python2.5/traceback.py", line 69, in print_tb 2009-10-23 10:25:16-0700 [-] line = linecache.getline(filename, lineno, f.f_globals) 2009-10-23 10:25:16-0700 [-] File "/tools/python-2.5.2/lib/python2.5/linecache.py", line 14, in getline 2009-10-23 10:25:16-0700 [-] lines = getlines(filename, module_globals) 2009-10-23 10:25:16-0700 [-] File "/tools/python-2.5.2/lib/python2.5/linecache.py", line 40, in getlines 2009-10-23 10:25:16-0700 [-] return updatecache(filename, module_globals) 2009-10-23 10:25:16-0700 [-] File "/tools/python-2.5.2/lib/python2.5/linecache.py", line 129, in updatecache 2009-10-23 10:25:16-0700 [-] lines = fp.readlines() 2009-10-23 10:25:16-0700 [-] MemoryError

Chris AtLee [:catlee]

Reporter

Comment 20

•

16 years ago

It would also be nice to get notified of exceptions that happen on slaves.

Summary: Detect hung slaves → Detect hung slaves, exceptions on slaves

Chris AtLee [:catlee]

Reporter

Comment 21

•

16 years ago

We can detect old versions of twistd.log on windows by adding this to nsc.ini in the [NRPE] section: allow_nasty_meta_chars=1 And then running this check from the server: ./check_nrpe -H $HOST -c CheckFile -a file="E:\\builds\\slave\\twistd.log" filter-written=\>12h MaxCrit=1 syntax="%filename% last changed %write%" We'll have to make sure we have consistent slave directory names though.

Chris AtLee [:catlee]

Reporter

Updated

•

16 years ago

Summary: Detect hung slaves, exceptions on slaves → Detect hung slaves

Nick Thomas [:nthomas] (UTC+12)

Comment 22

•

16 years ago

While cleaning up from bug 528670, found * moz2-darwin9-slave22 hung since Nov 12th * moz2-linux-slave39,40,48 hung since Nov 12th * moz2-linux-slave49 never completed connection on Nov 10th All connected to pm02.

Chris AtLee [:catlee]

Reporter

Comment 23

•

15 years ago

Attached patch add check_file_age check — Details — Splinter Review

when this lands, we can then add checks such as, ./check_nrpe -H $HOST -c check_file_age -a 7200 21600 /builds/slave/twistd.log to the nagios server

Attachment #426540 - Flags: review?(bhearsum)

Chris AtLee [:catlee]

Reporter

Comment 24

•

15 years ago

Attached patch add nagios configs on mac (obsolete) — Details — Splinter Review

Attachment #426566 - Flags: review?(bhearsum)

Chris AtLee [:catlee]

Reporter

Comment 25

•

15 years ago

Attached file nrpe.cfg for mac — Details

Attachment #426568 - Flags: review?(bhearsum)

Chris AtLee [:catlee]

Reporter

Comment 26

•

15 years ago

Attached patch OPSI package for managing nsc.ini on windows — Details — Splinter Review

This adds the 'allow_nasty_meta_chars' option to the existing nsc.ini file.

Attachment #426569 - Flags: review?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Updated

•

15 years ago

Attachment #426540 - Flags: review?(bhearsum) → review+

bhearsum@mozilla.com (:bhearsum)

Updated

•

15 years ago

Attachment #426566 - Flags: review?(bhearsum) → review+

bhearsum@mozilla.com (:bhearsum)

Updated

•

15 years ago

Attachment #426568 - Flags: review?(bhearsum) → review+

bhearsum@mozilla.com (:bhearsum)

Comment 27

•

15 years ago

Comment on attachment 426569 [details] [diff] [review] OPSI package for managing nsc.ini on windows What here allows us to check file age?

Chris AtLee [:catlee]

Reporter

Comment 28

•

15 years ago

Comment on attachment 426540 [details] [diff] [review] add check_file_age check cvs commit centos5/etc/nagios/nrpe.cfg darwin9/nrpe.cfg Checking in centos5/etc/nagios/nrpe.cfg; /mofo/puppet-files/centos5/etc/nagios/nrpe.cfg,v <-- nrpe.cfg new revision: 1.2; previous revision: 1.1 done RCS file: /mofo/puppet-files/darwin9/nrpe.cfg,v done Checking in darwin9/nrpe.cfg; /mofo/puppet-files/darwin9/nrpe.cfg,v <-- nrpe.cfg initial revision: 1.1 done

Attachment #426540 - Flags: checked-in+

Chris AtLee [:catlee]

Reporter

Comment 29

•

15 years ago

Comment on attachment 426568 [details] nrpe.cfg for mac cvs commit centos5/etc/nagios/nrpe.cfg darwin9/nrpe.cfg Checking in centos5/etc/nagios/nrpe.cfg; /mofo/puppet-files/centos5/etc/nagios/nrpe.cfg,v <-- nrpe.cfg new revision: 1.2; previous revision: 1.1 done RCS file: /mofo/puppet-files/darwin9/nrpe.cfg,v done Checking in darwin9/nrpe.cfg; /mofo/puppet-files/darwin9/nrpe.cfg,v <-- nrpe.cfg initial revision: 1.1 done

Attachment #426568 - Flags: checked-in+

Chris AtLee [:catlee]

Reporter

Comment 30

•

15 years ago

(In reply to comment #27) > (From update of attachment 426569 [details] [diff] [review]) > What here allows us to check file age? We can add a command in the nagios server to do this: check_nrpe -H $HOST -c CheckFile -a file="E:\\builds\\slave\\twistd.log" filter-written=\>12h MaxCrit=1 syntax="%filename% last changed %write%" CheckFile is already enabled via the CheckDisk module I believe.

Chris AtLee [:catlee]

Reporter

Updated

•

15 years ago

Assignee: catlee → nobody

alice nodelman [:alice] [:anode]

Comment 31

•

15 years ago

catlee - you seem to be actively working on this, should it be assigned to you?

Rail Aliiev [:rail]

Assignee

Updated

•

15 years ago

Assignee: nobody → raliiev

Chris AtLee [:catlee]

Reporter

Comment 32

•

15 years ago

bhearsum: ping on the r?

bhearsum@mozilla.com (:bhearsum)

Updated

•

15 years ago

Attachment #426569 - Flags: review?(bhearsum) → review+

Rail Aliiev [:rail]

Assignee

Comment 33

•

15 years ago

Attached patch add nagios configs on mac (refreshed) (obsolete) — Details — Splinter Review

Attaching refreshed copy. Looks like the corresponding file (nrpe.cfg) and symlink have been already deployed (by hand?), so it would be better if we deploy it via puppet.

Attachment #426566 - Attachment is obsolete: true

Attachment #435610 - Flags: review?(catlee)

Attachment #435610 - Flags: checked-in?

Chris AtLee [:catlee]

Reporter

Comment 34

•

15 years ago

Comment on attachment 435610 [details] [diff] [review] add nagios configs on mac (refreshed) ${fileroot}darwin9/nrpe.cfg should probably be changed to ${fileroot}darwin-shared/nrpe.cfg. Other than that, looks ok to me, but I know a lot has happened with osx ref platforms.

Attachment #435610 - Flags: review?(catlee)

Attachment #435610 - Flags: review?(bhearsum)

Attachment #435610 - Flags: review+

bhearsum@mozilla.com (:bhearsum)

Comment 35

•

15 years ago

Comment on attachment 435610 [details] [diff] [review] add nagios configs on mac (refreshed) Yeah, please move this to darwin-shared. r=me otherwise

Rail Aliiev [:rail]

Assignee

Comment 36

•

15 years ago

(In reply to comment #35) > (From update of attachment 435610 [details] [diff] [review]) > Yeah, please move this to darwin-shared. r=me otherwise Could anybody move the corresponding file from darwin9 to darwin-shared? The only reason why I didn't changed the path to darwin-shared one was the file which is in darwin9.

Rail Aliiev [:rail]

Assignee

Comment 37

•

15 years ago

Attached patch OPSI package for managing nsc.ini on windows + nscp restart (obsolete) — Details — Splinter Review

Could you review the following patch for nscp opsi package? I've just added winbatch_nscp_restart to restart nscp service and reread the configuration file.

Attachment #436135 - Flags: review?(bhearsum)

Rail Aliiev [:rail]

Assignee

Comment 38

•

15 years ago

Attached patch add nagios configs on mac (refreshed, use darwin-shared) — Details — Splinter Review

Attachment #435610 - Attachment is obsolete: true

Attachment #436159 - Flags: review?(bhearsum)

Attachment #435610 - Flags: review?(bhearsum)

Attachment #435610 - Flags: checked-in?

Armen [:armenzg]

Comment 39

•

15 years ago

(In reply to comment #36) > (In reply to comment #35) > > (From update of attachment 435610 [details] [diff] [review] [details]) > > Yeah, please move this to darwin-shared. r=me otherwise > > Could anybody move the corresponding file from darwin9 to darwin-shared? The > only reason why I didn't changed the path to darwin-shared one was the file > which is in darwin9. Done. [root@staging-puppet puppet-files]# cvs ci -m "Bug 523827. Move darwin9/nrpe.cfg to darwin-shared/nrpe.cfg." RCS file: /mofo/puppet-files/darwin-shared/nrpe.cfg,v done Checking in darwin-shared/nrpe.cfg; /mofo/puppet-files/darwin-shared/nrpe.cfg,v <-- nrpe.cfg initial revision: 1.1 done Removing darwin9/nrpe.cfg; /mofo/puppet-files/darwin9/nrpe.cfg,v <-- nrpe.cfg new revision: delete; previous revision: 1.1 done

bhearsum@mozilla.com (:bhearsum)

Updated

•

15 years ago

Attachment #436159 - Flags: review?(bhearsum) → review+

bhearsum@mozilla.com (:bhearsum)

Comment 40

•

15 years ago

Comment on attachment 436135 [details] [diff] [review] OPSI package for managing nsc.ini on windows + nscp restart This is incomplete; You're missing the entire OPSI/ directory. Does %ProgramFilesDir% actually work? It doesn't from a cmd prompt. You can also just use 'net stop/start nsclientpp' to control the service. Have you been able to test this in staging? If not, someone should show you how to.

Attachment #436135 - Flags: review?(bhearsum) → review-

Rail Aliiev [:rail]

Assignee

Comment 41

•

15 years ago

Attached patch OPSI package for managing nsc.ini on windows + nscp restart (obsolete) — Details — Splinter Review

(In reply to comment #40) > (From update of attachment 436135 [details] [diff] [review]) > This is incomplete; You're missing the entire OPSI/ directory. Ooops, forgot to hg add it. > Does %ProgramFilesDir% actually work? It doesn't from a cmd prompt. Do you mean DosBatch/DosInAction? I guess, it should work for winbatch. > You can > also just use 'net stop/start nsclientpp' to control the service. Good idea. > Have you been able to test this in staging? If not, someone should show you how > to. Not tested at all. Attaching the refreshed patch to be tested in staging.

Attachment #436135 - Attachment is obsolete: true

Rail Aliiev [:rail]

Assignee

Updated

•

15 years ago

Priority: -- → P2

Rail Aliiev [:rail]

Assignee

Updated

•

15 years ago

Attachment #436159 - Flags: checked-in?

Rail Aliiev [:rail]

Assignee

Updated

•

15 years ago

Attachment #436654 - Attachment is obsolete: true

Rail Aliiev [:rail]

Assignee

Comment 42

•

15 years ago

Comment on attachment 426569 [details] [diff] [review] OPSI package for managing nsc.ini on windows I tested this package in staging. worked as expected. Scenario: 1) create an OPSI package, install to opsi-staging. 2) deploy on win32-slave03 3) Compare old and new file (zero diff) 4) shut down buildbot and wait for Nagios CRITICAL alert 5) start buildbot and wait for Nagios OK alert

Attachment #426569 - Flags: checked-in?

bhearsum@mozilla.com (:bhearsum)

Comment 43

•

15 years ago

Comment on attachment 426569 [details] [diff] [review] OPSI package for managing nsc.ini on windows changeset: 47:554e579160cc

Attachment #426569 - Flags: checked-in? → checked-in+

bhearsum@mozilla.com (:bhearsum)

Comment 44

•

15 years ago

I set the OPSI package to roll out on all of the Windows build machines.

bhearsum@mozilla.com (:bhearsum)

Comment 45

•

15 years ago

Comment on attachment 436159 [details] [diff] [review] add nagios configs on mac (refreshed, use darwin-shared) changeset: 129:149711c89aee

Attachment #436159 - Flags: checked-in? → checked-in+

bhearsum@mozilla.com (:bhearsum)

Comment 46

•

15 years ago

The Mac machines should pick up the puppet file changes on next reboot.

bhearsum@mozilla.com (:bhearsum)

Comment 47

•

15 years ago

Most of the Windows machines have rebooted and got these changes. I also updated the ref machines.

Rail Aliiev [:rail]

Assignee

Updated

•

15 years ago

Depends on: 564404

Rail Aliiev [:rail]

Assignee

Updated

•

15 years ago

Whiteboard: [buildslaves]

Rail Aliiev [:rail]

Assignee

Comment 48

•

15 years ago

Now we have these checks enabled and working. Closing.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.