Closed Bug 1019523 Opened 10 years ago Closed 10 years ago

Large set of t-snow-r4 slaves is disabled (broken in slave-health)

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sbruno, Assigned: sbruno)

References

Details

Many t-snow-r4 slaves are in "broken" status in slave_health and have not been taking jobs for a long time: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-snow-r4 This does not seem related to the recent Train B move, which involved the following slaves (new name after move reported here) (source: Callek in #releng, https://callek.pastebin.mozilla.org/5327644): t-snow-r4-0041.test.releng.scl3.mozilla.com t-snow-r4-0042.test.releng.scl3.mozilla.com t-snow-r4-0043.test.releng.scl3.mozilla.com t-snow-r4-0044.test.releng.scl3.mozilla.com t-snow-r4-0045.test.releng.scl3.mozilla.com t-snow-r4-0046.test.releng.scl3.mozilla.com t-snow-r4-0047.test.releng.scl3.mozilla.com t-snow-r4-0048.test.releng.scl3.mozilla.com t-snow-r4-0049.test.releng.scl3.mozilla.com t-snow-r4-0050.test.releng.scl3.mozilla.com t-snow-r4-0051.test.releng.scl3.mozilla.com t-snow-r4-0052.test.releng.scl3.mozilla.com t-snow-r4-0053.test.releng.scl3.mozilla.com t-snow-r4-0054.test.releng.scl3.mozilla.com t-snow-r4-0055.test.releng.scl3.mozilla.com t-snow-r4-0056.test.releng.scl3.mozilla.com t-snow-r4-0057.test.releng.scl3.mozilla.com t-snow-r4-0058.test.releng.scl3.mozilla.com t-snow-r4-0059.test.releng.scl3.mozilla.com t-snow-r4-0060.test.releng.scl3.mozilla.com t-snow-r4-0061.test.releng.scl3.mozilla.com t-snow-r4-0062.test.releng.scl3.mozilla.com t-snow-r4-0063.test.releng.scl3.mozilla.com t-snow-r4-0064.test.releng.scl3.mozilla.com t-snow-r4-0065.test.releng.scl3.mozilla.com t-snow-r4-0066.test.releng.scl3.mozilla.com t-snow-r4-0067.test.releng.scl3.mozilla.com t-snow-r4-0068.test.releng.scl3.mozilla.com t-snow-r4-0069.test.releng.scl3.mozilla.com t-snow-r4-0070.test.releng.scl3.mozilla.com t-snow-r4-0071.test.releng.scl3.mozilla.com t-snow-r4-0072.test.releng.scl3.mozilla.com t-snow-r4-0073.test.releng.scl3.mozilla.com t-snow-r4-0074.test.releng.scl3.mozilla.com t-snow-r4-0075.test.releng.scl3.mozilla.com t-snow-r4-0076.test.releng.scl3.mozilla.com t-snow-r4-0077.test.releng.scl3.mozilla.com t-snow-r4-0078.test.releng.scl3.mozilla.com t-snow-r4-0079.test.releng.scl3.mozilla.com t-snow-r4-0080.test.releng.scl3.mozilla.com t-snow-r4-0081.test.releng.scl3.mozilla.com t-snow-r4-0082.test.releng.scl3.mozilla.com t-snow-r4-0083.test.releng.scl3.mozilla.com t-snow-r4-0084.test.releng.scl3.mozilla.com E.g., t-snow-r4-0002 is not listed here but it has not been taking jobs for more than 20 hours (at the time I am raising this). Theories to be verified (source: Callek in #releng): (*) disconnected mid job due to some network blip ~ 13 hours ago [now], so never rebooted (*) slaverebooter somehow not trying to reboot these, despite my memory of how it works, (*) slaveapi itself being wedged, (*) slaveapi not having flows to the new machines or their pdu's
Depends on: 986599
These seems related to some puppet installation issues after 3.6.1 upgrade, see https://bugzilla.mozilla.org/show_bug.cgi?id=986599#c34
Now that 986599 is fixed, I rebooted t-snow-r4-0002, which is now taking jobs again: http://buildbot-master108.srv.releng.scl3.mozilla.com:8201/buildslaves/t-snow-r4-0002 I will now reboot the slaves listed below - hopefully they will start working again as well. t-snow-r4-0011 t-snow-r4-0040 t-snow-r4-0033 t-snow-r4-0025 t-snow-r4-0006 t-snow-r4-0013 t-snow-r4-0030 t-snow-r4-0031 t-snow-r4-0004 t-snow-r4-0018 t-snow-r4-0037 t-snow-r4-0015 t-snow-r4-0027 t-snow-r4-0020 t-snow-r4-0034 t-snow-r4-0012 t-snow-r4-0016 t-snow-r4-0036 t-snow-r4-0007 t-snow-r4-0035 t-snow-r4-0010
Assignee: nobody → sbruno
All boxes seem to be back to work. (Thanks nthomas and )
The previous comment should have ended with: "Thanks nthomas and dustin for your help here!"
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Summary: Large set of t-snow-r4 are disabled (broken in slave-health) → Large set of t-snow-r4 slaves is disabled (broken in slave-health)
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.