Closed Bug 612288 Opened 14 years ago Closed 14 years ago

slave roundup

Categories

(Release Engineering :: General, defect, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Assigned: jhford)

References

Details

(Whiteboard: [buildduty][buildslaves])

There are a bunch of slaves that aren't talking to their masters.  We need to figure out why and fix it

These slaves are not talking to their masters (according to missing-slaves.py) but are responding to pings on their build.mozilla.org address

Host linux-ix-slave08 is up on build.mozilla.org
Host linux-ix-slave16 is up on build.mozilla.org
Host linux-ix-slave17 is up on build.mozilla.org
Host linux-ix-slave31 is up on build.mozilla.org
Host linux-ix-slave32 is up on build.mozilla.org
Host linux-ix-slave34 is up on build.mozilla.org
Host linux-ix-slave35 is up on build.mozilla.org
Host moz2-darwin10-slave10 is up on build.mozilla.org
Host moz2-darwin10-slave23 is up on build.mozilla.org
Host moz2-darwin10-slave25 is up on build.mozilla.org
Host moz2-darwin9-slave10 is up on build.mozilla.org
Host moz2-linux-slave10 is up on build.mozilla.org
Host moz2-linux64-slave10 is up on build.mozilla.org
Host mv-moz2-linux-ix-slave17 is up on build.mozilla.org
Host mv-moz2-linux-ix-slave21 is up on build.mozilla.org
Host mw32-ix-slave12 is up on build.mozilla.org
Host t-r3-w764-003 is up on build.mozilla.org
Host t-r3-w764-004 is up on build.mozilla.org
Host t-r3-w764-005 is up on build.mozilla.org
Host t-r3-w764-006 is up on build.mozilla.org
Host t-r3-w764-007 is up on build.mozilla.org
Host t-r3-w764-008 is up on build.mozilla.org
Host t-r3-w764-009 is up on build.mozilla.org
Host t-r3-w764-011 is up on build.mozilla.org
Host t-r3-w764-012 is up on build.mozilla.org
Host t-r3-w764-013 is up on build.mozilla.org
Host t-r3-w764-014 is up on build.mozilla.org
Host t-r3-w764-015 is up on build.mozilla.org
Host t-r3-w764-016 is up on build.mozilla.org
Host t-r3-w764-017 is up on build.mozilla.org
Host t-r3-w764-019 is up on build.mozilla.org
Host t-r3-w764-021 is up on build.mozilla.org
Host t-r3-w764-022 is up on build.mozilla.org
Host t-r3-w764-024 is up on build.mozilla.org
Host t-r3-w764-025 is up on build.mozilla.org
Host t-r3-w764-026 is up on build.mozilla.org
Host t-r3-w764-029 is up on build.mozilla.org
Host t-r3-w764-030 is up on build.mozilla.org
Host t-r3-w764-031 is up on build.mozilla.org
Host t-r3-w764-032 is up on build.mozilla.org
Host t-r3-w764-033 is up on build.mozilla.org
Host t-r3-w764-034 is up on build.mozilla.org
Host t-r3-w764-035 is up on build.mozilla.org
Host t-r3-w764-036 is up on build.mozilla.org
Host t-r3-w764-037 is up on build.mozilla.org
Host t-r3-w764-043 is up on build.mozilla.org
Host t-r3-w764-044 is up on build.mozilla.org
Host t-r3-w764-045 is up on build.mozilla.org
Host t-r3-w764-046 is up on build.mozilla.org
Host t-r3-w764-047 is up on build.mozilla.org
Host t-r3-w764-048 is up on build.mozilla.org
Host t-r3-w764-049 is up on build.mozilla.org
Host t-r3-w764-050 is up on build.mozilla.org
Host talos-r3-fed-004 is up on build.mozilla.org
Host talos-r3-fed-014 is up on build.mozilla.org
Host talos-r3-fed-017 is up on build.mozilla.org
Host talos-r3-fed-023 is up on build.mozilla.org
Host talos-r3-fed-029 is up on build.mozilla.org
Host talos-r3-fed-030 is up on build.mozilla.org
Host talos-r3-fed-032 is up on build.mozilla.org
Host talos-r3-fed-036 is up on build.mozilla.org
Host talos-r3-fed-039 is up on build.mozilla.org
Host talos-r3-fed-043 is up on build.mozilla.org
Host talos-r3-fed-051 is up on build.mozilla.org
Host talos-r3-fed64-006 is up on build.mozilla.org
Host talos-r3-fed64-007 is up on build.mozilla.org
Host talos-r3-fed64-017 is up on build.mozilla.org
Host talos-r3-fed64-018 is up on build.mozilla.org
Host talos-r3-fed64-019 is up on build.mozilla.org
Host talos-r3-fed64-023 is up on build.mozilla.org
Host talos-r3-fed64-028 is up on build.mozilla.org
Host talos-r3-fed64-037 is up on build.mozilla.org
Host talos-r3-fed64-040 is up on build.mozilla.org
Host talos-r3-fed64-042 is up on build.mozilla.org
Host talos-r3-leopard-005 is up on build.mozilla.org
Host talos-r3-leopard-006 is up on build.mozilla.org
Host talos-r3-leopard-007 is up on build.mozilla.org
Host talos-r3-leopard-008 is up on build.mozilla.org
Host talos-r3-leopard-009 is up on build.mozilla.org
Host talos-r3-leopard-014 is up on build.mozilla.org
Host talos-r3-leopard-015 is up on build.mozilla.org
Host talos-r3-leopard-017 is up on build.mozilla.org
Host talos-r3-leopard-018 is up on build.mozilla.org
Host talos-r3-leopard-019 is up on build.mozilla.org
Host talos-r3-leopard-021 is up on build.mozilla.org
Host talos-r3-leopard-032 is up on build.mozilla.org
Host talos-r3-leopard-034 is up on build.mozilla.org
Host talos-r3-leopard-045 is up on build.mozilla.org
Host talos-r3-snow-011 is up on build.mozilla.org
Host talos-r3-snow-015 is up on build.mozilla.org
Host talos-r3-snow-017 is up on build.mozilla.org
Host talos-r3-snow-018 is up on build.mozilla.org
Host talos-r3-snow-019 is up on build.mozilla.org
Host talos-r3-snow-020 is up on build.mozilla.org
Host talos-r3-snow-021 is up on build.mozilla.org
Host talos-r3-snow-022 is up on build.mozilla.org
Host talos-r3-snow-025 is up on build.mozilla.org
Host talos-r3-snow-026 is up on build.mozilla.org
Host talos-r3-snow-032 is up on build.mozilla.org
Host talos-r3-snow-033 is up on build.mozilla.org
Host talos-r3-snow-038 is up on build.mozilla.org
Host talos-r3-snow-040 is up on build.mozilla.org
Host talos-r3-snow-041 is up on build.mozilla.org
Host talos-r3-snow-042 is up on build.mozilla.org
Host talos-r3-snow-049 is up on build.mozilla.org
Host talos-r3-snow-053 is up on build.mozilla.org
Host talos-r3-snow-055 is up on build.mozilla.org
Host talos-r3-w7-004 is up on build.mozilla.org
Host talos-r3-w7-006 is up on build.mozilla.org
Host talos-r3-w7-007 is up on build.mozilla.org
Host talos-r3-w7-008 is up on build.mozilla.org
Host talos-r3-w7-009 is up on build.mozilla.org
Host talos-r3-w7-012 is up on build.mozilla.org
Host talos-r3-w7-013 is up on build.mozilla.org
Host talos-r3-w7-017 is up on build.mozilla.org
Host talos-r3-w7-024 is up on build.mozilla.org
Host talos-r3-xp-014 is up on build.mozilla.org
Host talos-r3-xp-018 is up on build.mozilla.org
Host talos-r3-xp-047 is up on build.mozilla.org
Host try-mac-slave28 is up on build.mozilla.org
Host w32-ix-slave08 is up on build.mozilla.org
Host w32-ix-slave34 is up on build.mozilla.org
Host w32-ix-slave36 is up on build.mozilla.org
Host win32-slave10 is up on build.mozilla.org
Assignee: nobody → jhford
Severity: normal → major
OS: Mac OS X → All
Priority: -- → P2
Hardware: x86 → All
Whiteboard: [buildduty][buildslaves]
This would explain why our wait times for test pool are degrading recently; we're seeing worse wait times, even with a smaller number of jobs! :-(
See Also: → 611846
Targetted hit list for fedora

talos-r3-fed-012
talos-r3-fed-017
talos-r3-fed-018
talos-r3-fed-019
talos-r3-fed-022
talos-r3-fed-024
talos-r3-fed-029
talos-r3-fed-035
talos-r3-fed-036
talos-r3-fed-039
talos-r3-fed-040
I tackled these:

talos-r3-leopard-003
talos-r3-leopard-005
talos-r3-leopard-006
talos-r3-leopard-007
talos-r3-leopard-008
talos-r3-leopard-009
talos-r3-leopard-012
talos-r3-leopard-013
talos-r3-leopard-014
talos-r3-leopard-015
talos-r3-leopard-016
talos-r3-leopard-017
talos-r3-leopard-018
talos-r3-leopard-019
talos-r3-leopard-031
talos-r3-leopard-032
talos-r3-leopard-034
talos-r3-leopard-038
talos-r3-leopard-040
talos-r3-leopard-042
talos-r3-leopard-045

For those that were actually moved to SCL I had to:
- verify that they were set to talk with scl-production-puppet.build.scl1.mozilla.com in:
> /Library/LaunchDaemons/com.reductivelabs.puppet.plist
- sudo rm -rf /etc/puppet/ssl/

For slaves talos-r3-leopard-0{14,17,19,32,34} (which are actually still on MV) I did the following:
- change them back to talk with mv-production-puppet.build.mozilla.org
- sudo rm -rf /etc/puppet/ssl/
- I also had to change their port back to 9012 as they were pointing to 9011
The following are talking with scl-production-puppet.build since they seem to be at scl (talos-r3-snow-025.build.scl1.mozilla.com has address 10.12.50.78)
talos-r3-snow-011
talos-r3-snow-017
talos-r3-snow-019
talos-r3-snow-021
talos-r3-snow-022
talos-r3-snow-025
talos-r3-snow-026
talos-r3-snow-031
talos-r3-snow-032
talos-r3-snow-033
talos-r3-snow-034
talos-r3-snow-038
talos-r3-snow-040
talos-r3-snow-042
talos-r3-snow-049
talos-r3-snow-053
talos-r3-snow-055

These are still at MV:
talos-r3-snow-015
talos-r3-snow-016
talos-r3-snow-018
I think they just need buildbot.tac.off to be moved but I would rather clean up any remaining slaves tomorrow.
(In reply to comment #2)
> Targetted hit list for fedora
> 
> talos-r3-fed-012

down, re-added to reboot bug

> talos-r3-fed-017

had to puppetca --clean, online and doing jobs

> talos-r3-fed-018

down, added to reboot bug

> talos-r3-fed-019

down, added to reboot bug

> talos-r3-fed-022

down, added to reboot bug

> talos-r3-fed-024

down, added to reboot bug

> talos-r3-fed-029

was turned off but didn't make it to santa clara.  turned it back on but it is in a weird state, not running in production

> talos-r3-fed-035
> talos-r3-fed-036
> talos-r3-fed-039
> talos-r3-fed-040
(In reply to comment #3)
> I tackled these:
> 
> talos-r3-leopard-003
> talos-r3-leopard-005
> talos-r3-leopard-006
> talos-r3-leopard-007
> talos-r3-leopard-008
> talos-r3-leopard-009
> talos-r3-leopard-012
> talos-r3-leopard-013
> talos-r3-leopard-014
> talos-r3-leopard-015
> talos-r3-leopard-016
> talos-r3-leopard-017
> talos-r3-leopard-018
> talos-r3-leopard-019
> talos-r3-leopard-031
> talos-r3-leopard-032
> talos-r3-leopard-034
> talos-r3-leopard-038
> talos-r3-leopard-040
> talos-r3-leopard-042
> talos-r3-leopard-045

These slaves weren't connecting to buildbot.  The logs suggested clearing the puppet certificate on the master.  I cleared the cert with puppetca for all of these leopard slaves and am rebooting/verifying connection.  So far I have verified to be connected to their master:
> talos-r3-leopard-008
> talos-r3-leopard-009
Verified to be online and connected to buildbot:

> talos-r3-leopard-012
> talos-r3-leopard-014
> talos-r3-leopard-013 - had wrong port number
> talos-r3-leopard-015
> talos-r3-leopard-017
> talos-r3-leopard-018
> talos-r3-leopard-019
> talos-r3-leopard-031
> talos-r3-leopard-032
> talos-r3-leopard-038
> talos-r3-leopard-040
> talos-r3-leopard-042
> talos-r3-leopard-045

Down:
> talos-r3-leopard-016

Worked on and need to verify the following:
> talos-r3-leopard-003
> talos-r3-leopard-005
> talos-r3-leopard-006
> talos-r3-leopard-007
> talos-r3-leopard-034
Verified to be online and connected to buildbot:
> talos-r3-leopard-003
> talos-r3-leopard-005
> talos-r3-leopard-006

I haven't a clue what is wrong with:

> talos-r3-leopard-007
> talos-r3-leopard-034
Verified on buildbot:
> talos-r3-snow-022
> talos-r3-snow-042
> talos-r3-snow-049
> talos-r3-snow-033
> talos-r3-snow-034
> talos-r3-snow-040

Need to debug:
> talos-r3-snow-011
> talos-r3-snow-017
> talos-r3-snow-019
> talos-r3-snow-021
> talos-r3-snow-025
> talos-r3-snow-026
> talos-r3-snow-031
> talos-r3-snow-032
> talos-r3-snow-038
> talos-r3-snow-053
> talos-r3-snow-055
> talos-r3-snow-015 (mv)
> talos-r3-snow-016 (mv)
> talos-r3-snow-018 (mv)
Verified:
> talos-r3-snow-031
> talos-r3-leopard-007
> talos-r3-snow-019 puppet was pointing to build.m.o not build.scl1.m.c
 puppet was pointing to build.m.o not build.scl1.m.c
> talos-r3-snow-015 (mv)
> talos-r3-snow-016 (mv)
> talos-r3-snow-018 (mv)

Down:
> talos-r3-snow-011 offline since Oct 29, 2010
> talos-r3-snow-017 offline since Oct 16, 2010
> talos-r3-snow-021 offline since Oct 27, 2010
> talos-r3-snow-025 offline since Nov 02, 2010
> talos-r3-snow-026 offline since Oct 19, 2010
> talos-r3-snow-032 offline since Oct 21, 2010
> talos-r3-snow-038 offline since Oct 26, 2010
> talos-r3-snow-053 offline since Nov 04, 2010
> talos-r3-snow-055 offline since Nov 04, 2010
Comment 10 should read as below (there as an accidental paste in comment 10)

Verified:
> talos-r3-snow-031
> talos-r3-leopard-007
> talos-r3-snow-019 puppet was pointing to build.m.o not build.scl1.m.c
> talos-r3-snow-015 (mv)
> talos-r3-snow-016 (mv)
> talos-r3-snow-018 (mv)

Down:
> talos-r3-snow-011 offline since Oct 29, 2010
> talos-r3-snow-017 offline since Oct 16, 2010
> talos-r3-snow-021 offline since Oct 27, 2010
> talos-r3-snow-025 offline since Nov 02, 2010
> talos-r3-snow-026 offline since Oct 19, 2010
> talos-r3-snow-032 offline since Oct 21, 2010
> talos-r3-snow-038 offline since Oct 26, 2010
> talos-r3-snow-053 offline since Nov 04, 2010
> talos-r3-snow-055 offline since Nov 04, 2010
Verified:
> talos-r3-fed-029


Down:
> talos-r3-fed-012 not responding to pings
> talos-r3-fed-018 not responding to pings
> talos-r3-fed-019 not responding to pings
> talos-r3-fed-022 not responding to pings
> talos-r3-fed-024 not responding to pings
> talos-r3-fed-035 not responding to pings
> talos-r3-fed-036 os is broken
> talos-r3-fed-039 not syncing with puppet
> talos-r3-fed-040 not responding to pings
I'm not sure what the current state is here.  Would it be worth using a google spreadsheet for this sort of "roundup", to facilitate tracking the changing state and coordinating multiple engineers?  The final state could be attached to the ticket for historical posterity, if necessary.
(In reply to comment #8)
> Verified to be online and connected to buildbot:
> > talos-r3-leopard-003

This box is currently not running buildbot - (buildbot.tac.off).  Is there a reason?
(In reply to comment #14)
> I'm not sure what the current state is here.  Would it be worth using a google
> spreadsheet for this sort of "roundup", to facilitate tracking the changing
> state and coordinating multiple engineers?  The final state could be attached
> to the ticket for historical posterity, if necessary.

Good call. Bear, Rail, and I had been using an etherpad this week, but a spreadsheet probably makes more sense. I'll get that setup today and move our etherpad data into it.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.