Closed Bug 906660 Opened 11 years ago Closed 11 years ago

Windows builders seem to falling behind

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Windows Server 2008
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: armenzg)

References

Details

09:25 RyanVM: I'm only seeing 28 running Windows builds+l10n at the moment (not including Try)
09:25 RyanVM: actually, not even that many
09:25 RyanVM: but Windows builds are falling behind
09:25 RyanVM: do slaves need a kick?

Reference: https://secure.pub.build.mozilla.org/buildapi/running
Search for "WINNT 5.2"
I think I see 40 of them offline around 6 and 4 days ago.

I saw this on twistd.log ("slaves" branch):
http://hg.mozilla.org/build/buildbot/file/9dc77b3a5f14/slave/buildslave/idleizer.py#l131

I don't see anything else on twistd.log of interested or on the event manager.
It seems that bhearsum will be fixing the IP banning. Let's hope this would prevent following issues.

09:54 armenzg_buildduty: jhopkins: good morning; there are around 40 Windows builders that are up but they are not taking anymore jobs
09:54 armenzg_buildduty: twistd.log said that they were idle and tried to reboot themselves but nothing happened
09:54 armenzg_buildduty: I want to figure why briar patch did not successfully manage to reboot them
09:55 armenzg_buildduty: for instance https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave92
09:55 jhopkins: armenzg_buildduty: ok.  can you give me a couple of sample hostnames?                                                                      
09:55 armenzg_buildduty: I see a message from Aug 12th Mon Aug 12 18:39:24 2013
09:55 armenzg_buildduty: could I give that host and another one for you to look into it? 
09:55 armenzg_buildduty: I will reboot the remaining
09:55 armenzg_buildduty: it doesn't have to be now 
09:55 armenzg_buildduty: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave64

10:14 jhopkins: armenzg_buildduty: for w64-ix-slave92, cruncher was in the ip-ban list so an SSH connection was not possible.  An IPMI reboot claimed to be successful but did nothing.  It appears that the IPMI user/password it used may be incorrect.  Still looking...
10:15 armenzg_buildduty: jhopkins: if I get instructions I can later fix the IP banning on all win64 machines
10:15 bhearsum: armenzg_buildduty: i'm going to fix that soon

13:33 catlee: armenzg_buildduty: were there any stuck windows slaves this morning?
13:33 armenzg_brb: catlee: they tried to reboot because they were idle but never did
13:34 armenzg_brb: briar patch tried to reboot them through ssh but it was on the IP ban list
13:34 catlee: ah ha
13:34 armenzg_brb: jhopkins has been looking more into it
13:34 catlee: sounds like https://bugzilla.mozilla.org/show_bug.cgi?id=893859
13:34 armenzg_brb: briar patch also did ipmi reboots
13:35 catlee: I could deploy the new buildbot to some of the build masters to see if it helps
13:35 catlee: did we remove any builders last week?
13:36 catlee: yeah, we did...
13:36 armenzg_brb: catlee: maybe the b2d pandas  
13:36 armenzg_brb: if you could pick a master and deploy it that would be great
13:36 armenzg_brb: I could keep an eye
Today I rebooted a bunch more that I had left out yesterday.
w64-ix-slave{79,82,128,120,19,95,122,96,104,107,99,42,129,74,87,70,110,24,78,116}

Let's see which ones come back.
We should be good now.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
This is on-going.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
For example:
w64-ix-slave56
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=w64-ix&name=w64-ix-slave56

Last job ends at Thu Aug 22 17:54:39 2013
Kitten herder tries to reboot few hours later: Fri Aug 23 00:42:09 2013
If I check twistd.log I see that it rebooted and re-connected.

If I check with VNC I will see the event shutdown tracker - bug 893888

I will try to check all slaves and see what is going on. It should have been fixed.
Depends on: 893888
Depends on: 895914
No longer depends on: 893888
Most machines got deployed the shutdown event tracker fix.
We should now be good.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.