Closed Bug 612288 Opened 14 years ago Closed 14 years ago

slave roundup

Categories

(Release Engineering :: General, defect, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Assigned: jhford)

References

Details

(Whiteboard: [buildduty][buildslaves])

There are a bunch of slaves that aren't talking to their masters. We need to figure out why and fix it These slaves are not talking to their masters (according to missing-slaves.py) but are responding to pings on their build.mozilla.org address Host linux-ix-slave08 is up on build.mozilla.org Host linux-ix-slave16 is up on build.mozilla.org Host linux-ix-slave17 is up on build.mozilla.org Host linux-ix-slave31 is up on build.mozilla.org Host linux-ix-slave32 is up on build.mozilla.org Host linux-ix-slave34 is up on build.mozilla.org Host linux-ix-slave35 is up on build.mozilla.org Host moz2-darwin10-slave10 is up on build.mozilla.org Host moz2-darwin10-slave23 is up on build.mozilla.org Host moz2-darwin10-slave25 is up on build.mozilla.org Host moz2-darwin9-slave10 is up on build.mozilla.org Host moz2-linux-slave10 is up on build.mozilla.org Host moz2-linux64-slave10 is up on build.mozilla.org Host mv-moz2-linux-ix-slave17 is up on build.mozilla.org Host mv-moz2-linux-ix-slave21 is up on build.mozilla.org Host mw32-ix-slave12 is up on build.mozilla.org Host t-r3-w764-003 is up on build.mozilla.org Host t-r3-w764-004 is up on build.mozilla.org Host t-r3-w764-005 is up on build.mozilla.org Host t-r3-w764-006 is up on build.mozilla.org Host t-r3-w764-007 is up on build.mozilla.org Host t-r3-w764-008 is up on build.mozilla.org Host t-r3-w764-009 is up on build.mozilla.org Host t-r3-w764-011 is up on build.mozilla.org Host t-r3-w764-012 is up on build.mozilla.org Host t-r3-w764-013 is up on build.mozilla.org Host t-r3-w764-014 is up on build.mozilla.org Host t-r3-w764-015 is up on build.mozilla.org Host t-r3-w764-016 is up on build.mozilla.org Host t-r3-w764-017 is up on build.mozilla.org Host t-r3-w764-019 is up on build.mozilla.org Host t-r3-w764-021 is up on build.mozilla.org Host t-r3-w764-022 is up on build.mozilla.org Host t-r3-w764-024 is up on build.mozilla.org Host t-r3-w764-025 is up on build.mozilla.org Host t-r3-w764-026 is up on build.mozilla.org Host t-r3-w764-029 is up on build.mozilla.org Host t-r3-w764-030 is up on build.mozilla.org Host t-r3-w764-031 is up on build.mozilla.org Host t-r3-w764-032 is up on build.mozilla.org Host t-r3-w764-033 is up on build.mozilla.org Host t-r3-w764-034 is up on build.mozilla.org Host t-r3-w764-035 is up on build.mozilla.org Host t-r3-w764-036 is up on build.mozilla.org Host t-r3-w764-037 is up on build.mozilla.org Host t-r3-w764-043 is up on build.mozilla.org Host t-r3-w764-044 is up on build.mozilla.org Host t-r3-w764-045 is up on build.mozilla.org Host t-r3-w764-046 is up on build.mozilla.org Host t-r3-w764-047 is up on build.mozilla.org Host t-r3-w764-048 is up on build.mozilla.org Host t-r3-w764-049 is up on build.mozilla.org Host t-r3-w764-050 is up on build.mozilla.org Host talos-r3-fed-004 is up on build.mozilla.org Host talos-r3-fed-014 is up on build.mozilla.org Host talos-r3-fed-017 is up on build.mozilla.org Host talos-r3-fed-023 is up on build.mozilla.org Host talos-r3-fed-029 is up on build.mozilla.org Host talos-r3-fed-030 is up on build.mozilla.org Host talos-r3-fed-032 is up on build.mozilla.org Host talos-r3-fed-036 is up on build.mozilla.org Host talos-r3-fed-039 is up on build.mozilla.org Host talos-r3-fed-043 is up on build.mozilla.org Host talos-r3-fed-051 is up on build.mozilla.org Host talos-r3-fed64-006 is up on build.mozilla.org Host talos-r3-fed64-007 is up on build.mozilla.org Host talos-r3-fed64-017 is up on build.mozilla.org Host talos-r3-fed64-018 is up on build.mozilla.org Host talos-r3-fed64-019 is up on build.mozilla.org Host talos-r3-fed64-023 is up on build.mozilla.org Host talos-r3-fed64-028 is up on build.mozilla.org Host talos-r3-fed64-037 is up on build.mozilla.org Host talos-r3-fed64-040 is up on build.mozilla.org Host talos-r3-fed64-042 is up on build.mozilla.org Host talos-r3-leopard-005 is up on build.mozilla.org Host talos-r3-leopard-006 is up on build.mozilla.org Host talos-r3-leopard-007 is up on build.mozilla.org Host talos-r3-leopard-008 is up on build.mozilla.org Host talos-r3-leopard-009 is up on build.mozilla.org Host talos-r3-leopard-014 is up on build.mozilla.org Host talos-r3-leopard-015 is up on build.mozilla.org Host talos-r3-leopard-017 is up on build.mozilla.org Host talos-r3-leopard-018 is up on build.mozilla.org Host talos-r3-leopard-019 is up on build.mozilla.org Host talos-r3-leopard-021 is up on build.mozilla.org Host talos-r3-leopard-032 is up on build.mozilla.org Host talos-r3-leopard-034 is up on build.mozilla.org Host talos-r3-leopard-045 is up on build.mozilla.org Host talos-r3-snow-011 is up on build.mozilla.org Host talos-r3-snow-015 is up on build.mozilla.org Host talos-r3-snow-017 is up on build.mozilla.org Host talos-r3-snow-018 is up on build.mozilla.org Host talos-r3-snow-019 is up on build.mozilla.org Host talos-r3-snow-020 is up on build.mozilla.org Host talos-r3-snow-021 is up on build.mozilla.org Host talos-r3-snow-022 is up on build.mozilla.org Host talos-r3-snow-025 is up on build.mozilla.org Host talos-r3-snow-026 is up on build.mozilla.org Host talos-r3-snow-032 is up on build.mozilla.org Host talos-r3-snow-033 is up on build.mozilla.org Host talos-r3-snow-038 is up on build.mozilla.org Host talos-r3-snow-040 is up on build.mozilla.org Host talos-r3-snow-041 is up on build.mozilla.org Host talos-r3-snow-042 is up on build.mozilla.org Host talos-r3-snow-049 is up on build.mozilla.org Host talos-r3-snow-053 is up on build.mozilla.org Host talos-r3-snow-055 is up on build.mozilla.org Host talos-r3-w7-004 is up on build.mozilla.org Host talos-r3-w7-006 is up on build.mozilla.org Host talos-r3-w7-007 is up on build.mozilla.org Host talos-r3-w7-008 is up on build.mozilla.org Host talos-r3-w7-009 is up on build.mozilla.org Host talos-r3-w7-012 is up on build.mozilla.org Host talos-r3-w7-013 is up on build.mozilla.org Host talos-r3-w7-017 is up on build.mozilla.org Host talos-r3-w7-024 is up on build.mozilla.org Host talos-r3-xp-014 is up on build.mozilla.org Host talos-r3-xp-018 is up on build.mozilla.org Host talos-r3-xp-047 is up on build.mozilla.org Host try-mac-slave28 is up on build.mozilla.org Host w32-ix-slave08 is up on build.mozilla.org Host w32-ix-slave34 is up on build.mozilla.org Host w32-ix-slave36 is up on build.mozilla.org Host win32-slave10 is up on build.mozilla.org
Assignee: nobody → jhford
Severity: normal → major
OS: Mac OS X → All
Priority: -- → P2
Hardware: x86 → All
Whiteboard: [buildduty][buildslaves]
This would explain why our wait times for test pool are degrading recently; we're seeing worse wait times, even with a smaller number of jobs! :-(
See Also: → 611846
Targetted hit list for fedora talos-r3-fed-012 talos-r3-fed-017 talos-r3-fed-018 talos-r3-fed-019 talos-r3-fed-022 talos-r3-fed-024 talos-r3-fed-029 talos-r3-fed-035 talos-r3-fed-036 talos-r3-fed-039 talos-r3-fed-040
I tackled these: talos-r3-leopard-003 talos-r3-leopard-005 talos-r3-leopard-006 talos-r3-leopard-007 talos-r3-leopard-008 talos-r3-leopard-009 talos-r3-leopard-012 talos-r3-leopard-013 talos-r3-leopard-014 talos-r3-leopard-015 talos-r3-leopard-016 talos-r3-leopard-017 talos-r3-leopard-018 talos-r3-leopard-019 talos-r3-leopard-031 talos-r3-leopard-032 talos-r3-leopard-034 talos-r3-leopard-038 talos-r3-leopard-040 talos-r3-leopard-042 talos-r3-leopard-045 For those that were actually moved to SCL I had to: - verify that they were set to talk with scl-production-puppet.build.scl1.mozilla.com in: > /Library/LaunchDaemons/com.reductivelabs.puppet.plist - sudo rm -rf /etc/puppet/ssl/ For slaves talos-r3-leopard-0{14,17,19,32,34} (which are actually still on MV) I did the following: - change them back to talk with mv-production-puppet.build.mozilla.org - sudo rm -rf /etc/puppet/ssl/ - I also had to change their port back to 9012 as they were pointing to 9011
The following are talking with scl-production-puppet.build since they seem to be at scl (talos-r3-snow-025.build.scl1.mozilla.com has address 10.12.50.78) talos-r3-snow-011 talos-r3-snow-017 talos-r3-snow-019 talos-r3-snow-021 talos-r3-snow-022 talos-r3-snow-025 talos-r3-snow-026 talos-r3-snow-031 talos-r3-snow-032 talos-r3-snow-033 talos-r3-snow-034 talos-r3-snow-038 talos-r3-snow-040 talos-r3-snow-042 talos-r3-snow-049 talos-r3-snow-053 talos-r3-snow-055 These are still at MV: talos-r3-snow-015 talos-r3-snow-016 talos-r3-snow-018 I think they just need buildbot.tac.off to be moved but I would rather clean up any remaining slaves tomorrow.
(In reply to comment #2) > Targetted hit list for fedora > > talos-r3-fed-012 down, re-added to reboot bug > talos-r3-fed-017 had to puppetca --clean, online and doing jobs > talos-r3-fed-018 down, added to reboot bug > talos-r3-fed-019 down, added to reboot bug > talos-r3-fed-022 down, added to reboot bug > talos-r3-fed-024 down, added to reboot bug > talos-r3-fed-029 was turned off but didn't make it to santa clara. turned it back on but it is in a weird state, not running in production > talos-r3-fed-035 > talos-r3-fed-036 > talos-r3-fed-039 > talos-r3-fed-040
(In reply to comment #3) > I tackled these: > > talos-r3-leopard-003 > talos-r3-leopard-005 > talos-r3-leopard-006 > talos-r3-leopard-007 > talos-r3-leopard-008 > talos-r3-leopard-009 > talos-r3-leopard-012 > talos-r3-leopard-013 > talos-r3-leopard-014 > talos-r3-leopard-015 > talos-r3-leopard-016 > talos-r3-leopard-017 > talos-r3-leopard-018 > talos-r3-leopard-019 > talos-r3-leopard-031 > talos-r3-leopard-032 > talos-r3-leopard-034 > talos-r3-leopard-038 > talos-r3-leopard-040 > talos-r3-leopard-042 > talos-r3-leopard-045 These slaves weren't connecting to buildbot. The logs suggested clearing the puppet certificate on the master. I cleared the cert with puppetca for all of these leopard slaves and am rebooting/verifying connection. So far I have verified to be connected to their master: > talos-r3-leopard-008 > talos-r3-leopard-009
Verified to be online and connected to buildbot: > talos-r3-leopard-012 > talos-r3-leopard-014 > talos-r3-leopard-013 - had wrong port number > talos-r3-leopard-015 > talos-r3-leopard-017 > talos-r3-leopard-018 > talos-r3-leopard-019 > talos-r3-leopard-031 > talos-r3-leopard-032 > talos-r3-leopard-038 > talos-r3-leopard-040 > talos-r3-leopard-042 > talos-r3-leopard-045 Down: > talos-r3-leopard-016 Worked on and need to verify the following: > talos-r3-leopard-003 > talos-r3-leopard-005 > talos-r3-leopard-006 > talos-r3-leopard-007 > talos-r3-leopard-034
Verified to be online and connected to buildbot: > talos-r3-leopard-003 > talos-r3-leopard-005 > talos-r3-leopard-006 I haven't a clue what is wrong with: > talos-r3-leopard-007 > talos-r3-leopard-034
Verified on buildbot: > talos-r3-snow-022 > talos-r3-snow-042 > talos-r3-snow-049 > talos-r3-snow-033 > talos-r3-snow-034 > talos-r3-snow-040 Need to debug: > talos-r3-snow-011 > talos-r3-snow-017 > talos-r3-snow-019 > talos-r3-snow-021 > talos-r3-snow-025 > talos-r3-snow-026 > talos-r3-snow-031 > talos-r3-snow-032 > talos-r3-snow-038 > talos-r3-snow-053 > talos-r3-snow-055 > talos-r3-snow-015 (mv) > talos-r3-snow-016 (mv) > talos-r3-snow-018 (mv)
Verified: > talos-r3-snow-031 > talos-r3-leopard-007 > talos-r3-snow-019 puppet was pointing to build.m.o not build.scl1.m.c puppet was pointing to build.m.o not build.scl1.m.c > talos-r3-snow-015 (mv) > talos-r3-snow-016 (mv) > talos-r3-snow-018 (mv) Down: > talos-r3-snow-011 offline since Oct 29, 2010 > talos-r3-snow-017 offline since Oct 16, 2010 > talos-r3-snow-021 offline since Oct 27, 2010 > talos-r3-snow-025 offline since Nov 02, 2010 > talos-r3-snow-026 offline since Oct 19, 2010 > talos-r3-snow-032 offline since Oct 21, 2010 > talos-r3-snow-038 offline since Oct 26, 2010 > talos-r3-snow-053 offline since Nov 04, 2010 > talos-r3-snow-055 offline since Nov 04, 2010
Comment 10 should read as below (there as an accidental paste in comment 10) Verified: > talos-r3-snow-031 > talos-r3-leopard-007 > talos-r3-snow-019 puppet was pointing to build.m.o not build.scl1.m.c > talos-r3-snow-015 (mv) > talos-r3-snow-016 (mv) > talos-r3-snow-018 (mv) Down: > talos-r3-snow-011 offline since Oct 29, 2010 > talos-r3-snow-017 offline since Oct 16, 2010 > talos-r3-snow-021 offline since Oct 27, 2010 > talos-r3-snow-025 offline since Nov 02, 2010 > talos-r3-snow-026 offline since Oct 19, 2010 > talos-r3-snow-032 offline since Oct 21, 2010 > talos-r3-snow-038 offline since Oct 26, 2010 > talos-r3-snow-053 offline since Nov 04, 2010 > talos-r3-snow-055 offline since Nov 04, 2010
Verified: > talos-r3-fed-029 Down: > talos-r3-fed-012 not responding to pings > talos-r3-fed-018 not responding to pings > talos-r3-fed-019 not responding to pings > talos-r3-fed-022 not responding to pings > talos-r3-fed-024 not responding to pings > talos-r3-fed-035 not responding to pings > talos-r3-fed-036 os is broken > talos-r3-fed-039 not syncing with puppet > talos-r3-fed-040 not responding to pings
I'm not sure what the current state is here. Would it be worth using a google spreadsheet for this sort of "roundup", to facilitate tracking the changing state and coordinating multiple engineers? The final state could be attached to the ticket for historical posterity, if necessary.
(In reply to comment #8) > Verified to be online and connected to buildbot: > > talos-r3-leopard-003 This box is currently not running buildbot - (buildbot.tac.off). Is there a reason?
(In reply to comment #14) > I'm not sure what the current state is here. Would it be worth using a google > spreadsheet for this sort of "roundup", to facilitate tracking the changing > state and coordinating multiple engineers? The final state could be attached > to the ticket for historical posterity, if necessary. Good call. Bear, Rail, and I had been using an etherpad this week, but a spreadsheet probably makes more sense. I'll get that setup today and move our etherpad data into it.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.