636342 - MTV build/test slave outage

1) These failing builds are visible in buildapi, but not displayed on tbpl (unknown why). However found on tinderbox page with help from dholbert. 2) Slaves: mw32-ix-slave14, mw32-ix-slave17, both failed out at approx the same time, with the same error msg: .... adding changesets adding manifests transaction abort! rollback completed ** unknown exception encountered, please report by visiting ** http://mercurial.selenic.com/wiki/BugTracker ** Python 2.6.6 (r266:84297, Aug 24 2010, 18:46:32) [MSC v.1500 32 bit (Intel)] ** Mercurial Distributed SCM (version 1.7.5) ** Extensions loaded: win32text, graphlog, share, purge Traceback (most recent call last): File "hg", line 38, in <module> File "mercurial\dispatch.pyc", line 16, in run File "mercurial\dispatch.pyc", line 36, in dispatch File "mercurial\dispatch.pyc", line 58, in _runcatch File "mercurial\dispatch.pyc", line 593, in _dispatch File "mercurial\dispatch.pyc", line 401, in runcommand File "mercurial\dispatch.pyc", line 644, in _runcommand File "mercurial\dispatch.pyc", line 598, in checkargs File "mercurial\dispatch.pyc", line 591, in <lambda> File "mercurial\util.pyc", line 426, in check File "mercurial\commands.pyc", line 736, in clone File "mercurial\hg.pyc", line 337, in clone File "mercurial\localrepo.pyc", line 1886, in clone File "mercurial\localrepo.pyc", line 1295, in pull File "mercurial\localrepo.pyc", line 1711, in addchangegroup File "mercurial\revlog.pyc", line 1381, in addgroup File "mercurial\revlog.pyc", line 1220, in _addrevision mpatch.mpatchError: patch cannot be decoded program finished with exit code 255 elapsedTime=357.000000 === Output ended === .... Sample logs are here: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1298508382.1298509599.12859.gz http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1298507507.1298508722.8920.gz http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1298507784.1298508018.6325.gz Dont know why this is happening, and just to these two slaves, starting approx same time. As far as I know, we didnt deploy new version of mercurial to slaves today. for now, I'm going to take these two slaves out of production, and rekick the builds to see what happens.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 4

•

14 years ago

I've moved aside the buildbot.tac files for mw32-ix-slave14, mw32-ix-slave17 and rebooted them. With them out of the way, we're rekicking those builds to see if it happens again on other slaves.

Whiteboard: [slaveduty]

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 5

•

14 years ago

(In reply to comment #4) > I've moved aside the buildbot.tac files for mw32-ix-slave14, mw32-ix-slave17 > and rebooted them. > > With them out of the way, we're rekicking those builds to see if it happens > again on other slaves. 5 failed jobs rekicked.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 6

•

14 years ago

May or may not be related. In #build, I see a bunch of other machines in the same server room in 650castro fail to respond to ping. 18:18:38 < nagios> [41] mw32-ix-slave22.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:18:51 < nagios> [42] w32-ix-slave18.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:19:36 < nagios> [43] w32-ix-slave06.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:22:09 < nagios> [46] w32-ix-slave13.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:23:37 < nagios> [47] mw32-ix-slave11.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:30:21 < nagios> [48] mw32-ix-slave13.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:31:18 < nagios> [49] mw32-ix-slave15.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:33:44 < nagios> [50] moz2-win32-slave24.build:hung slave is CRITICAL: twistd.log last changed Saturday, February 19, 2011 02:17:34, : 1 critical 18:38:01 < nagios> [51] w32-ix-slave03.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:43:16 * joduinn looks up and wonders why we are suddenly losing a bunch of win32 slaves in 650castro? Passing the baton to nthomas (with thanks), as he's still at his desk, and I'm now late for class.

Assignee: joduinn → nrthomas

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

14 years ago

We started off with failures to clone from hg.mozilla.org (while still maintaining a network connection to the buildbot master in MPT) and now mw32-ix-slave14 and mw32-ix-slave17 are inaccessible. So are several other Windows hosts, from nagios PING checks: mw32-ix-slave01.build mw32-ix-slave11.build mw32-ix-slave13.build mw32-ix-slave15.build mw32-ix-slave19.build mw32-ix-slave21.build mw32-ix-slave22.build mw32-ix-slave24.build w32-ix-slave03.build w32-ix-slave06.build w32-ix-slave12.build w32-ix-slave13.build w32-ix-slave18.build Those starting failing from about 1730 PST. Aki also reports Tegra's and n900's down. Some are still falling over slowly, while others are still responsive (eg mw32-ix-slave03, can't find any linux boxes down).

Assignee: nrthomas → server-ops

Severity: blocker → critical

Component: Release Engineering → Server Operations

QA Contact: release → mrz

Nick Thomas [:nthomas] (UTC+12)

Comment 8

•

14 years ago

The win32 boxes don't close the tree, we have enough capacity elsewhere to make do. Missing Tegra's and n900s does if it's more than an hour or so. Aki is in the office and taking a quick look.

Nick Thomas [:nthomas] (UTC+12)

Updated

•

14 years ago

Summary: Windows builds are failing over and over on mozilla-central → Some RelEng assets in Mt View inaccessible

Aki Sasaki (not active)

Comment 9

•

14 years ago

Unable to raise the n900s/tegras. -> blocker

Severity: critical → blocker

Summary: Some RelEng assets in Mt View inaccessible → MTV build/test slave outage

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 10

•

14 years ago

(In reply to comment #8) > The win32 boxes don't close the tree, we have enough capacity elsewhere to make > do. Missing Tegra's and n900s does if it's more than an hour or so. Agreed, tree closed because of lack of tegra and n900 coverage. > Aki is in the office and taking a quick look. Correction. Aki is in SF, not in the office.

OS: Windows Server 2003 → All

Shyam Mani [:fox2mike]

Updated

•

14 years ago

Assignee: server-ops → shyam

Aki Sasaki (not active)

Comment 11

•

14 years ago

It looks like machines that have been up stay up, but if they reboot they don't come back up. Possibly DHCP.

Aki Sasaki (not active)

Comment 12

•

14 years ago

Machines are coming back up. Root cause looks like new firewall rules that blocked dhcp (and I suppose broke hg per comment 3 ?) Lowering priority. We need a postmortem + verify that everything is good.

Severity: blocker → major

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 13

•

14 years ago

from irc with dmoore, zandr, aki, joduinn, ravi, fox2mike. Looks like there were filter changes made to firewall in MV this evening ~17:17 which caused rebooting machines to not get new IPs. dmoore reverted for now. Full postmortem tmrw. Once builds report green again, we'll reopen tree and can then reduce from blocker.

Severity: major → blocker

Ravi Pina [:ravi]

Assignee

Comment 14

•

14 years ago

I'm confused how the escalation went with this. The bug was opened at 2011-02-23 17:21:41 PST which was very soon after I applied my change. I stayed in the office for over an hour to verify there were no issues as a result of this and left when all appeared clear. 3 hours later on a fluke dmoore signed in and shortly after I did as well.

Shyam Mani [:fox2mike]

Comment 15

•

14 years ago

Over to NetOps, assigning to dmoore since he fixed it.

Assignee: shyam → dmoore

Component: Server Operations → Server Operations: Netops

Ravi Pina [:ravi]

Assignee

Comment 16

•

14 years ago

I added the filter that caused the DHCP issue so over to me to make it not do that when I reapply.

Assignee: dmoore → ravi

Dustin J. Mitchell [:dustin] (he/him)

Comment 17

•

14 years ago

For the history books, the complete list of down'd slaves was: linux-ix-slave16 mw32-ix-slave01 mw32-ix-slave02 mw32-ix-slave10 mw32-ix-slave11 mw32-ix-slave13 mw32-ix-slave14 mw32-ix-slave15 mw32-ix-slave17 mw32-ix-slave19 mw32-ix-slave21 mw32-ix-slave22 mw32-ix-slave24 w32-ix-slave03 w32-ix-slave06 w32-ix-slave12 w32-ix-slave13 w32-ix-slave14 w32-ix-slave16 w32-ix-slave18 w32-ix-slave20 w32-ix-slave21 w32-ix-slave25 and all bug linux-ix-slave16 came back. Which suggests DHCP is to blame. I've added a new bug to fix this on centos5 - bug 636390.

Rail Aliiev [:rail]

Comment 18

•

14 years ago

mozilla-central tree has just been opened. Mobile tree is also green except mochitest2 and mochitest3 (expected).

Component: Server Operations: Netops → Server Operations

Aki Sasaki (not active)

Comment 19

•

14 years ago

(In reply to comment #14) > I'm confused how the escalation went with this. The bug was opened at > 2011-02-23 17:21:41 PST which was very soon after I applied my change. I > stayed in the office for over an hour to verify there were no issues as a > result of this and left when all appeared clear. 3 hours later on a fluke > dmoore signed in and shortly after I did as well. This first looked like hg errors, which could have been isolated to a couple build slaves so Joduinn took those slaves out of the pool. Then all the win32 mv slaves went down, but we have coverage in scl1, so we didn't close the tree. However, downing all tegras+n900s would close the tree, so we moved the bug over to serverops + raised severity at that point. Those only went down after a reboot, so took a while to get back up. I think we need to be talking about how to keep track of changes so people can figure out what changed when that could have broken things. What's IT's change management system? Bugzilla?

Severity: blocker → major

Aki Sasaki (not active)

Comment 20

•

14 years ago

(In reply to comment #19) > Those only went down after a reboot, so took a while to get back up. Er, "those were fine until they tried to reboot, and couldn't come back up" or something. Either way, it took time for the issue to show itself as a complete outage.

bhearsum@mozilla.com (:bhearsum)

Comment 21

•

14 years ago

I think there's still an issue here. We're getting extremely slow network traffic between MV slaves and MPT machines (like hg.m.o, for example). Doesn't affect MPT or SCL based slaves. We've shut down these machines for now so this isn't a blocker yet, but still important. If the mobile machines are hosed too, this becomes a blocker.

Severity: major → critical

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

14 years ago

Blocks: 636462

Dustin J. Mitchell [:dustin] (he/him)

Comment 22

•

14 years ago

I've filed bug 636462 to track releng details about this morning's trouble. We're assuming for the moment that this is related to the filter changes from last night.

Ravi Pina [:ravi]

Assignee

Comment 23

•

14 years ago

I think you guys are tailgating this new issue with yesterday's issue. While I understand your sensitivity to things in the last week, please don't conflate the two as it may negatively impact the troubleshooting process. The all values are within normal daily average in MTV1: firewall CPU bandwidth to SJC1 latency from a host in MTV1 to SJC1.

bhearsum@mozilla.com (:bhearsum)

Comment 24

•

14 years ago

Apologies if this is unrelated, I wasn't sure based on the earlier comments if this issue was 100% fixed. I certainly don't mean to confuse matters.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 25

•

14 years ago

(In reply to comment #4) > I've moved aside the buildbot.tac files for mw32-ix-slave14, mw32-ix-slave17 > and rebooted them. > > With them out of the way, we're rekicking those builds to see if it happens > again on other slaves. Spinning off bug#636475 to get these innocent slaves back into production.

matthew zeier [:mrz]

Comment 26

•

14 years ago

This bug is confusing bug can largely be summarized by DHCP issues in MVT. Bug 636462 is tracking hg issues not unlike comment #3, calling this closed.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard