1248347 - Many Windows testers became non-responsive

The following slaves have been rebooted from IPMI: - t-w732-ix-242 - t-w732-ix-209 - t-w732-ix-201 - t-w732-ix-199 - t-w732-ix-196 - t-w732-ix-186 - t-w732-ix-184 - t-w732-ix-159 - t-w732-ix-152 - t-w732-ix-140 - t-w732-ix-139 - t-w732-ix-121 - t-w732-ix-110 - t-w732-ix-103 - t-w732-ix-099 - t-w732-ix-079 - t-w732-ix-076 - t-w732-ix-064 - t-w732-ix-038 - t-w732-ix-011 - t-w732-ix-005

Mark Cornmesser [:markco]

Comment 4

•

9 years ago

I have disabled t-w732-ix-242. Once the current test are completed I will check into it and see if i can find any reason for the unpingable state.

Van Le [:van]

Assignee

Comment 5

•

9 years ago

please let me know which hosts you need me to look at.

Assignee: server-ops-dcops → vle

colo-trip: --- → scl3

QA Contact: cshields

Mark Cornmesser [:markco]

Comment 6

•

9 years ago

From looking at t-w732-ix-242 it looks like the machine lost power shortly after 1600 on 2016-02-15. Nothing interesting in the logs before and no logs for about 10 hours after. This shows up shortly after the logs start up again: The previous system shutdown at 4:03:26 PM on ‎2/‎15/‎2016 was unexpected. van: Are all these machines connected through the same power unit?

Flags: needinfo?(vle)

Van Le [:van]

Assignee

Comment 7

•

9 years ago

:marko, the nodes in #c3 are in at least 4 different racks. the chassis which houses the hosts are also on redundant PSUs. i checked the switches in the racks and they're all showing up uptimes of 244 days. >From looking at t-w732-ix-242 it looks like the machine lost power shortly after 1600 on 2016-02-15. were the XP and linux hosts affected? they're also in this rack so power would be lost to them as well. https://inventory.mozilla.org/en-US/systems/racks/?rack=227

Flags: needinfo?(vle)

Amy Rich [:arr] [:arich]

Comment 8

•

9 years ago

It looks very much like these are going down due to some timer or something. There are more down this evening and they're dying off in small batches (see the last column, the duration down). t-w732-ix-217.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 18:32:11 0d 0h 33m 21s t-w732-ix-079.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 18:28:21 0d 0h 52m 44s t-w732-ix-238.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 18:28:21 0d 0h 52m 44s t-w732-ix-191.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 18:30:11 0d 1h 17m 40s t-w732-ix-070.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 18:29:31 0d 1h 18m 0s t-w732-ix-181.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 18:29:31 0d 1h 18m 0s t-w732-ix-133.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 18:30:01 0d 2h 16m 14s t-w732-ix-196.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 18:29:01 0d 2h 17m 14s t-w732-ix-117.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 18:28:51 0d 2h 17m 24s t-w732-ix-119.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 18:28:51 0d 2h 17m 24s t-w732-ix-166.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 18:28:51 0d 2h 17m 24s t-w732-ix-169.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 18:28:51 0d 2h 17m 24s t-w732-ix-110.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 18:29:31 0d 4h 30m 31s

Amy Rich [:arr] [:arich]

Comment 9

•

9 years ago

Two more: t-w732-ix-008.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 19:02:11 0d 0h 21m 50s t-w732-ix-078.wintest.releng.scl3.mozilla.com DOWN 02-16-2016 19:02:11 0d 0h 21m 50s I've rebooted them all via IPMI. 110 was down for a bad disk (being handled in another bug), and 119 didn't come back up.

Amy Rich [:arr] [:arich]

Comment 10

•

9 years ago

There were another 54 down this morning: t-w732-ix-253 t-w732-ix-079 t-w732-ix-006 t-w732-ix-144 t-w732-ix-061 t-w732-ix-004 t-w732-ix-050 t-w732-ix-065 t-w732-ix-210 t-w732-ix-008 t-w732-ix-062 t-w732-ix-082 t-w732-ix-125 t-w732-ix-130 t-w732-ix-135 t-w732-ix-217 t-w732-ix-229 t-w732-ix-250 t-w732-ix-034 t-w732-ix-048 t-w732-ix-089 t-w732-ix-129 t-w732-ix-139 t-w732-ix-095 t-w732-ix-169 t-w732-ix-212 t-w732-ix-231 t-w732-ix-012 t-w732-ix-186 t-w732-ix-221 t-w732-ix-153 t-w732-ix-114 t-w732-ix-183 t-w732-ix-017 t-w732-ix-073 t-w732-ix-059 t-w732-ix-067 t-w732-ix-103 t-w732-ix-225 t-w732-ix-075 t-w732-ix-002 t-w732-ix-015 t-w732-ix-188 t-w732-ix-007 t-w732-ix-243 t-w732-ix-251 t-w732-ix-124 t-w732-ix-237 t-w732-ix-001 t-w732-ix-112 t-w732-ix-053 t-w732-ix-195 t-w732-ix-155 t-w732-ix-108

Armen [:armenzg]

Comment 11

•

9 years ago

Which of these symptoms apply? * Unresponsive to pings * Not taking anymore Buildbot jobs (i.e. up and running but not taking jobs) * Cannot ssh * Cannot RDP/VNC * Completely off It would be interesting to look at the logs of the last job the machine took and see if it something caused by a job. I've tried looking but it is hard to know without knowing when the machine was rebooted. Are there anything still running on the host? (any lingering processes or prompts).

Joel Maher ( :jmaher ) (UTC -8)

Comment 12

•

9 years ago

is it possible we have a virus. I suspect there is a pattern we could find to help us determine if this is a datacenter issue, an OS config issue, or something else like a virus or exceptions in our builds/tests.

Armen [:armenzg]

Comment 13

•

9 years ago

Could it be that we're consuming too much power at certain points when there are a lot of jobs running at once? Has there been any power related work? Do the same machines get into the same state after a reboot? Does a re-image take away the issue?

Armen [:armenzg]

Comment 14

•

9 years ago

Could we be running out of disk?

Amy Rich [:arr] [:arich]

Comment 15

•

9 years ago

(In reply to Armen Zambrano [:armenzg] - Engineering productivity from comment #11) The machines are unresponsive to ping (which means no jobs, no ssh, no rdp/vnc, no connections at all). Since the onboard graphics cards are turned off (otherwise jobs fail), there's no way to see if the machine is actually shut down or still doing things without a network connection. (In reply to Joel Majer [:jmaher] - Engineering productivity from comment #12) I think a virus is unlikely, but there's a slim possibility, of course. We are looking for the pattern which will help us determine the cause, yes. So far nothing obvious. from irc: philor: don't know if this is somehow coincidence, but if you look at what the win7 slaves were doing when they died, I think you'll find a massive preponderance were running e10s browser-chrome-7 on a trunk tree philor: https://treeherder.mozilla.org/#/jobs?repo=fx-team&fromchange=b2d75ac5ba0f&group_state=expanded&filter-resultStatus=retry&filter-searchStr=windows%207 There were some e10s patches made to buildbot on the 12th, but they were broken, backed out, and reapplied. They were also for jsreftest, so that seems unlikely to be our culprit. I didn't see anything else in hg (under build) that looked like it might be applicable: https://hg.mozilla.org/build?sort=lastchange I was also wondering if a buildbot restart might have picked up an issue that was checked in earlier.

Armen [:armenzg]

Comment 16

•

9 years ago

Do we know if this started on the 12th? or earlier? I see an increase since the 9th of bug 1247453 (Intermittent zombiecheck | child process XXXX still alive after shutdown) which is not high enough to be this. https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1247453&startday=2016-02-08&endday=2016-02-17&tree=all

Amy Rich [:arr] [:arich]

Comment 17

•

9 years ago

Van, can you hook a monitor up to the following hosts to see what state they're in? We're wondering if they're still up but frozen (maybe we've run out of memory). t-w732-ix-092.wintest.releng.scl3.mozilla.com t-w732-ix-236.wintest.releng.scl3.mozilla.com t-w732-ix-036.wintest.releng.scl3.mozilla.com t-w732-ix-040.wintest.releng.scl3.mozilla.com And can you please enable the onboard graphics for t-w732-ix-092.wintest.releng.scl3.mozilla.com and t-w732-ix-040.wintest.releng.scl3.mozilla.com when you're done?

Flags: needinfo?(vle)

Van Le [:van]

Assignee

Comment 18

•

9 years ago

t-w732-ix-092 - no video output, no activity when i plugged in a mouse/keyboard t-w732-ix-236 - no video output, no activity when i plugged in a mouse/keyboard t-w732-ix-036 - sitting at desktop t-w732-ix-040 - no video output, no activity when i plugged in a mouse/keyboard

Flags: needinfo?(vle)

Van Le [:van]

Assignee

Comment 19

•

9 years ago

>t-w732-ix-092.wintest.releng.scl3.mozilla.com and t-w732-ix-040.wintest.releng.scl3.mozilla.com changed to onboard video.

Phil Ringnalda (:philor)

Comment 20

•

9 years ago

"Fixed" by the backout in https://hg.mozilla.org/mozilla-central/rev/0629918a09ae - the why of bug 1232042 causing what I'll bet was a driver bluescreen remains to be determined, and there'll no doubt be calls to update the video driver, but as far as I'm concerned this bug is fixed.

Q

Comment 21

•

9 years ago

Turned on SOL remote access for 040 and 092 we can disable and enable those cards without a colo trip if have more issues. I have disabled the onbroad cards, verfied resolution, and the machines are about to be re-enabled in slavealloc

Q

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Bugzilla

Many Windows testers became non-responsive

Categories

(Infrastructure & Operations :: DCOps, task, P3)

Tracking

(Not tracked)

People

(Reporter: aselagea, Assigned: van)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Comment 20

Comment 21

Updated