Closed
Bug 906660
Opened 11 years ago
Closed 11 years ago
Windows builders seem to falling behind
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: armenzg)
References
Details
09:25 RyanVM: I'm only seeing 28 running Windows builds+l10n at the moment (not including Try) 09:25 RyanVM: actually, not even that many 09:25 RyanVM: but Windows builds are falling behind 09:25 RyanVM: do slaves need a kick? Reference: https://secure.pub.build.mozilla.org/buildapi/running Search for "WINNT 5.2"
Assignee | ||
Comment 1•11 years ago
|
||
I think I see 40 of them offline around 6 and 4 days ago. I saw this on twistd.log ("slaves" branch): http://hg.mozilla.org/build/buildbot/file/9dc77b3a5f14/slave/buildslave/idleizer.py#l131 I don't see anything else on twistd.log of interested or on the event manager.
Assignee | ||
Comment 2•11 years ago
|
||
It seems that bhearsum will be fixing the IP banning. Let's hope this would prevent following issues. 09:54 armenzg_buildduty: jhopkins: good morning; there are around 40 Windows builders that are up but they are not taking anymore jobs 09:54 armenzg_buildduty: twistd.log said that they were idle and tried to reboot themselves but nothing happened 09:54 armenzg_buildduty: I want to figure why briar patch did not successfully manage to reboot them 09:55 armenzg_buildduty: for instance https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave92 09:55 jhopkins: armenzg_buildduty: ok. can you give me a couple of sample hostnames? 09:55 armenzg_buildduty: I see a message from Aug 12th Mon Aug 12 18:39:24 2013 09:55 armenzg_buildduty: could I give that host and another one for you to look into it? 09:55 armenzg_buildduty: I will reboot the remaining 09:55 armenzg_buildduty: it doesn't have to be now 09:55 armenzg_buildduty: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave64 10:14 jhopkins: armenzg_buildduty: for w64-ix-slave92, cruncher was in the ip-ban list so an SSH connection was not possible. An IPMI reboot claimed to be successful but did nothing. It appears that the IPMI user/password it used may be incorrect. Still looking... 10:15 armenzg_buildduty: jhopkins: if I get instructions I can later fix the IP banning on all win64 machines 10:15 bhearsum: armenzg_buildduty: i'm going to fix that soon 13:33 catlee: armenzg_buildduty: were there any stuck windows slaves this morning? 13:33 armenzg_brb: catlee: they tried to reboot because they were idle but never did 13:34 armenzg_brb: briar patch tried to reboot them through ssh but it was on the IP ban list 13:34 catlee: ah ha 13:34 armenzg_brb: jhopkins has been looking more into it 13:34 catlee: sounds like https://bugzilla.mozilla.org/show_bug.cgi?id=893859 13:34 armenzg_brb: briar patch also did ipmi reboots 13:35 catlee: I could deploy the new buildbot to some of the build masters to see if it helps 13:35 catlee: did we remove any builders last week? 13:36 catlee: yeah, we did... 13:36 armenzg_brb: catlee: maybe the b2d pandas 13:36 armenzg_brb: if you could pick a master and deploy it that would be great 13:36 armenzg_brb: I could keep an eye
Assignee | ||
Comment 3•11 years ago
|
||
Today I rebooted a bunch more that I had left out yesterday. w64-ix-slave{79,82,128,120,19,95,122,96,104,107,99,42,129,74,87,70,110,24,78,116} Let's see which ones come back.
Assignee | ||
Comment 4•11 years ago
|
||
We should be good now.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 5•11 years ago
|
||
This is on-going.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 6•11 years ago
|
||
For example: w64-ix-slave56 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave56 Last job ends at Thu Aug 22 17:54:39 2013 Kitten herder tries to reboot few hours later: Fri Aug 23 00:42:09 2013 If I check twistd.log I see that it rebooted and re-connected. If I check with VNC I will see the event shutdown tracker - bug 893888 I will try to check all slaves and see what is going on. It should have been fixed.
Assignee | ||
Updated•11 years ago
|
Assignee | ||
Comment 7•11 years ago
|
||
Most machines got deployed the shutdown event tracker fix. We should now be good.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•