Closed Bug 1248347 Opened 8 years ago Closed 8 years ago

Many Windows testers became non-responsive

Categories

(Infrastructure & Operations :: DCOps, task, P3)

x86
Windows 7

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aselagea, Assigned: van)

Details

We noticed many Windows testers in an idle state today. The Nagios dashboard revealed that many of them were not responding to ping.
The windows slaves have been rebooted via IPMI and everything has returned to normal
This morning we noticed again that some windows slaves were down.
The following slaves have been rebooted from IPMI:
- t-w732-ix-242
- t-w732-ix-209
- t-w732-ix-201
- t-w732-ix-199
- t-w732-ix-196
- t-w732-ix-186
- t-w732-ix-184
- t-w732-ix-159
- t-w732-ix-152
- t-w732-ix-140
- t-w732-ix-139
- t-w732-ix-121
- t-w732-ix-110
- t-w732-ix-103
- t-w732-ix-099
- t-w732-ix-079
- t-w732-ix-076
- t-w732-ix-064
- t-w732-ix-038
- t-w732-ix-011
- t-w732-ix-005
I have disabled t-w732-ix-242. Once the current test are completed I will check into it and see if i can find any reason for the unpingable state.
please let me know which hosts you need me to look at.
Assignee: server-ops-dcops → vle
colo-trip: --- → scl3
QA Contact: cshields
From looking at t-w732-ix-242 it looks like the machine lost power shortly after 1600 on 2016-02-15. Nothing interesting in the logs before and no logs for about 10 hours after. This shows up shortly after the logs start up again: 

The previous system shutdown at 4:03:26 PM on ‎2/‎15/‎2016 was unexpected.

van: Are all these machines connected through the same power unit?
Flags: needinfo?(vle)
:marko, the nodes in #c3 are in at least 4 different racks. the chassis which houses the hosts are also on redundant PSUs. i checked the switches in the racks and they're all showing up uptimes of 244 days. 

>From looking at t-w732-ix-242 it looks like the machine lost power shortly after 1600 on 2016-02-15. 

were the XP and linux hosts affected? they're also in this rack so power would be lost to them as well.

https://inventory.mozilla.org/en-US/systems/racks/?rack=227
Flags: needinfo?(vle)
It looks very much like these are going down due to some timer or something. There are more down this evening and they're dying off in small batches (see the last column, the duration down). 

t-w732-ix-217.wintest.releng.scl3.mozilla.com
  DOWN  02-16-2016 18:32:11   0d 0h 33m 21s
t-w732-ix-079.wintest.releng.scl3.mozilla.com
  DOWN  02-16-2016 18:28:21   0d 0h 52m 44s    
t-w732-ix-238.wintest.releng.scl3.mozilla.com
  DOWN  02-16-2016 18:28:21   0d 0h 52m 44s    
t-w732-ix-191.wintest.releng.scl3.mozilla.com
  DOWN  02-16-2016 18:30:11   0d 1h 17m 40s    
t-w732-ix-070.wintest.releng.scl3.mozilla.com
  DOWN  02-16-2016 18:29:31   0d 1h 18m 0s   
t-w732-ix-181.wintest.releng.scl3.mozilla.com 
  DOWN  02-16-2016 18:29:31   0d 1h 18m 0s   
t-w732-ix-133.wintest.releng.scl3.mozilla.com 
  DOWN  02-16-2016 18:30:01   0d 2h 16m 14s    
t-w732-ix-196.wintest.releng.scl3.mozilla.com 
  DOWN  02-16-2016 18:29:01   0d 2h 17m 14s    
t-w732-ix-117.wintest.releng.scl3.mozilla.com 
  DOWN  02-16-2016 18:28:51   0d 2h 17m 24s    
t-w732-ix-119.wintest.releng.scl3.mozilla.com 
  DOWN  02-16-2016 18:28:51   0d 2h 17m 24s    
t-w732-ix-166.wintest.releng.scl3.mozilla.com 
  DOWN  02-16-2016 18:28:51   0d 2h 17m 24s    
t-w732-ix-169.wintest.releng.scl3.mozilla.com 
  DOWN  02-16-2016 18:28:51   0d 2h 17m 24s    
t-w732-ix-110.wintest.releng.scl3.mozilla.com
  DOWN  02-16-2016 18:29:31   0d 4h 30m 31s
Two more:

t-w732-ix-008.wintest.releng.scl3.mozilla.com 
  DOWN 	02-16-2016 19:02:11 	0d 0h 21m 50s
t-w732-ix-078.wintest.releng.scl3.mozilla.com 
  DOWN 	02-16-2016 19:02:11 	0d 0h 21m 50s

I've rebooted them all via IPMI. 110 was down for a bad disk (being handled in another bug), and 119 didn't come back up.
There were another 54 down this morning:

t-w732-ix-253
t-w732-ix-079
t-w732-ix-006
t-w732-ix-144
t-w732-ix-061
t-w732-ix-004
t-w732-ix-050
t-w732-ix-065
t-w732-ix-210
t-w732-ix-008
t-w732-ix-062
t-w732-ix-082
t-w732-ix-125
t-w732-ix-130
t-w732-ix-135
t-w732-ix-217
t-w732-ix-229
t-w732-ix-250
t-w732-ix-034
t-w732-ix-048
t-w732-ix-089
t-w732-ix-129
t-w732-ix-139
t-w732-ix-095
t-w732-ix-169
t-w732-ix-212
t-w732-ix-231
t-w732-ix-012
t-w732-ix-186
t-w732-ix-221
t-w732-ix-153
t-w732-ix-114
t-w732-ix-183
t-w732-ix-017
t-w732-ix-073
t-w732-ix-059
t-w732-ix-067
t-w732-ix-103
t-w732-ix-225
t-w732-ix-075
t-w732-ix-002
t-w732-ix-015
t-w732-ix-188
t-w732-ix-007
t-w732-ix-243
t-w732-ix-251
t-w732-ix-124
t-w732-ix-237
t-w732-ix-001
t-w732-ix-112
t-w732-ix-053
t-w732-ix-195
t-w732-ix-155
t-w732-ix-108
Which of these symptoms apply?
* Unresponsive to pings
* Not taking anymore Buildbot jobs (i.e. up and running but not taking jobs)
* Cannot ssh
* Cannot RDP/VNC
* Completely off

It would be interesting to look at the logs of the last job the machine took and see if it something caused by a job.
I've tried looking but it is hard to know without knowing when the machine was rebooted.

Are there anything still running on the host? (any lingering processes or prompts).
is it possible we have a virus.  I suspect there is a pattern we could find to help us determine if this is a datacenter issue, an OS config issue, or something else like a virus or exceptions in our builds/tests.
Could it be that we're consuming too much power at certain points when there are a lot of jobs running at once?
Has there been any power related work?

Do the same machines get into the same state after a reboot?
Does a re-image take away the issue?
Could we be running out of disk?
(In reply to Armen Zambrano [:armenzg] - Engineering productivity from comment #11)

The machines are unresponsive to ping (which means no jobs, no ssh, no rdp/vnc, no connections at all). Since the onboard graphics cards are turned off (otherwise jobs fail), there's no way to see if the machine is actually shut down or still doing things without a network connection.


(In reply to Joel Majer [:jmaher] - Engineering productivity from comment #12)

I think a virus is unlikely, but there's a slim possibility, of course. We are looking for the pattern which will help us determine the cause, yes. So far nothing obvious.


from irc:

philor: don't know if this is somehow coincidence, but if you look at what the win7 slaves were doing when they died, I think you'll find a massive preponderance were running e10s browser-chrome-7 on a trunk tree
philor: https://treeherder.mozilla.org/#/jobs?repo=fx-team&fromchange=b2d75ac5ba0f&group_state=expanded&filter-resultStatus=retry&filter-searchStr=windows%207

There were some e10s patches made to buildbot on the 12th, but they were broken, backed out, and reapplied. They were also for jsreftest, so that seems unlikely to be our culprit. I didn't see anything else in hg (under build) that looked like it might be applicable: https://hg.mozilla.org/build?sort=lastchange

I was also wondering if a buildbot restart might have picked up an issue that was checked in earlier.
Do we know if this started on the 12th? or earlier?

I see an increase since the 9th of bug 1247453 (Intermittent zombiecheck | child process XXXX still alive after shutdown) which is not high enough to be this.
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1247453&startday=2016-02-08&endday=2016-02-17&tree=all
Van, can you hook a monitor up to the following hosts to see what state they're in? We're wondering if they're still up but frozen (maybe we've run out of memory).

t-w732-ix-092.wintest.releng.scl3.mozilla.com 
t-w732-ix-236.wintest.releng.scl3.mozilla.com 
t-w732-ix-036.wintest.releng.scl3.mozilla.com 
t-w732-ix-040.wintest.releng.scl3.mozilla.com 

And can you please enable the onboard graphics for t-w732-ix-092.wintest.releng.scl3.mozilla.com and t-w732-ix-040.wintest.releng.scl3.mozilla.com when you're done?
Flags: needinfo?(vle)
t-w732-ix-092 - no video output, no activity when i plugged in a mouse/keyboard

t-w732-ix-236 - no video output, no activity when i plugged in a mouse/keyboard

t-w732-ix-036 - sitting at desktop

t-w732-ix-040 - no video output, no activity when i plugged in a mouse/keyboard
Flags: needinfo?(vle)
>t-w732-ix-092.wintest.releng.scl3.mozilla.com and t-w732-ix-040.wintest.releng.scl3.mozilla.com

changed to onboard video.
"Fixed" by the backout in https://hg.mozilla.org/mozilla-central/rev/0629918a09ae - the why of bug 1232042 causing what I'll bet was a driver bluescreen remains to be determined, and there'll no doubt be calls to update the video driver, but as far as I'm concerned this bug is fixed.
Turned on SOL remote access for 040 and 092 we can disable and enable those cards without a colo trip if have more issues. I have disabled the onbroad cards, verfied resolution, and the machines are about to be re-enabled in slavealloc
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.