Closed Bug 636342 Opened 14 years ago Closed 14 years ago

MTV build/test slave outage

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ehsan.akhgari, Assigned: ravi)

References

()

Details

(Whiteboard: [slaveduty])

See the URL. I'm closing the tree until the issue is resolved.
Severity: normal → blocker
taking after being ping'd in irc.
Assignee: nobody → joduinn
OS: Mac OS X → Windows Server 2003
failure logs: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1298508382.1298509599.12859.gz http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1298507507.1298508722.8920.gz In both cases, the failure is: > python: can't open file 'tools/buildfarm/maintenance/count_and_reboot.py': [Errno 2] No such file or directory > program finished with exit code 2
1) These failing builds are visible in buildapi, but not displayed on tbpl (unknown why). However found on tinderbox page with help from dholbert. 2) Slaves: mw32-ix-slave14, mw32-ix-slave17, both failed out at approx the same time, with the same error msg: .... adding changesets adding manifests transaction abort! rollback completed ** unknown exception encountered, please report by visiting ** http://mercurial.selenic.com/wiki/BugTracker ** Python 2.6.6 (r266:84297, Aug 24 2010, 18:46:32) [MSC v.1500 32 bit (Intel)] ** Mercurial Distributed SCM (version 1.7.5) ** Extensions loaded: win32text, graphlog, share, purge Traceback (most recent call last): File "hg", line 38, in <module> File "mercurial\dispatch.pyc", line 16, in run File "mercurial\dispatch.pyc", line 36, in dispatch File "mercurial\dispatch.pyc", line 58, in _runcatch File "mercurial\dispatch.pyc", line 593, in _dispatch File "mercurial\dispatch.pyc", line 401, in runcommand File "mercurial\dispatch.pyc", line 644, in _runcommand File "mercurial\dispatch.pyc", line 598, in checkargs File "mercurial\dispatch.pyc", line 591, in <lambda> File "mercurial\util.pyc", line 426, in check File "mercurial\commands.pyc", line 736, in clone File "mercurial\hg.pyc", line 337, in clone File "mercurial\localrepo.pyc", line 1886, in clone File "mercurial\localrepo.pyc", line 1295, in pull File "mercurial\localrepo.pyc", line 1711, in addchangegroup File "mercurial\revlog.pyc", line 1381, in addgroup File "mercurial\revlog.pyc", line 1220, in _addrevision mpatch.mpatchError: patch cannot be decoded program finished with exit code 255 elapsedTime=357.000000 === Output ended === .... Sample logs are here: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1298508382.1298509599.12859.gz http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1298507507.1298508722.8920.gz http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1298507784.1298508018.6325.gz Dont know why this is happening, and just to these two slaves, starting approx same time. As far as I know, we didnt deploy new version of mercurial to slaves today. for now, I'm going to take these two slaves out of production, and rekick the builds to see what happens.
I've moved aside the buildbot.tac files for mw32-ix-slave14, mw32-ix-slave17 and rebooted them. With them out of the way, we're rekicking those builds to see if it happens again on other slaves.
Whiteboard: [slaveduty]
(In reply to comment #4) > I've moved aside the buildbot.tac files for mw32-ix-slave14, mw32-ix-slave17 > and rebooted them. > > With them out of the way, we're rekicking those builds to see if it happens > again on other slaves. 5 failed jobs rekicked.
May or may not be related. In #build, I see a bunch of other machines in the same server room in 650castro fail to respond to ping. 18:18:38 < nagios> [41] mw32-ix-slave22.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:18:51 < nagios> [42] w32-ix-slave18.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:19:36 < nagios> [43] w32-ix-slave06.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:22:09 < nagios> [46] w32-ix-slave13.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:23:37 < nagios> [47] mw32-ix-slave11.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:30:21 < nagios> [48] mw32-ix-slave13.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:31:18 < nagios> [49] mw32-ix-slave15.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:33:44 < nagios> [50] moz2-win32-slave24.build:hung slave is CRITICAL: twistd.log last changed Saturday, February 19, 2011 02:17:34, : 1 critical 18:38:01 < nagios> [51] w32-ix-slave03.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% 18:43:16 * joduinn looks up and wonders why we are suddenly losing a bunch of win32 slaves in 650castro? Passing the baton to nthomas (with thanks), as he's still at his desk, and I'm now late for class.
Assignee: joduinn → nrthomas
We started off with failures to clone from hg.mozilla.org (while still maintaining a network connection to the buildbot master in MPT) and now mw32-ix-slave14 and mw32-ix-slave17 are inaccessible. So are several other Windows hosts, from nagios PING checks: mw32-ix-slave01.build mw32-ix-slave11.build mw32-ix-slave13.build mw32-ix-slave15.build mw32-ix-slave19.build mw32-ix-slave21.build mw32-ix-slave22.build mw32-ix-slave24.build w32-ix-slave03.build w32-ix-slave06.build w32-ix-slave12.build w32-ix-slave13.build w32-ix-slave18.build Those starting failing from about 1730 PST. Aki also reports Tegra's and n900's down. Some are still falling over slowly, while others are still responsive (eg mw32-ix-slave03, can't find any linux boxes down).
Assignee: nrthomas → server-ops
Severity: blocker → critical
Component: Release Engineering → Server Operations
QA Contact: release → mrz
The win32 boxes don't close the tree, we have enough capacity elsewhere to make do. Missing Tegra's and n900s does if it's more than an hour or so. Aki is in the office and taking a quick look.
Summary: Windows builds are failing over and over on mozilla-central → Some RelEng assets in Mt View inaccessible
Unable to raise the n900s/tegras. -> blocker
Severity: critical → blocker
Summary: Some RelEng assets in Mt View inaccessible → MTV build/test slave outage
(In reply to comment #8) > The win32 boxes don't close the tree, we have enough capacity elsewhere to make > do. Missing Tegra's and n900s does if it's more than an hour or so. Agreed, tree closed because of lack of tegra and n900 coverage. > Aki is in the office and taking a quick look. Correction. Aki is in SF, not in the office.
OS: Windows Server 2003 → All
Assignee: server-ops → shyam
It looks like machines that have been up stay up, but if they reboot they don't come back up. Possibly DHCP.
Machines are coming back up. Root cause looks like new firewall rules that blocked dhcp (and I suppose broke hg per comment 3 ?) Lowering priority. We need a postmortem + verify that everything is good.
Severity: blocker → major
from irc with dmoore, zandr, aki, joduinn, ravi, fox2mike. Looks like there were filter changes made to firewall in MV this evening ~17:17 which caused rebooting machines to not get new IPs. dmoore reverted for now. Full postmortem tmrw. Once builds report green again, we'll reopen tree and can then reduce from blocker.
Severity: major → blocker
I'm confused how the escalation went with this. The bug was opened at 2011-02-23 17:21:41 PST which was very soon after I applied my change. I stayed in the office for over an hour to verify there were no issues as a result of this and left when all appeared clear. 3 hours later on a fluke dmoore signed in and shortly after I did as well.
Over to NetOps, assigning to dmoore since he fixed it.
Assignee: shyam → dmoore
Component: Server Operations → Server Operations: Netops
I added the filter that caused the DHCP issue so over to me to make it not do that when I reapply.
Assignee: dmoore → ravi
For the history books, the complete list of down'd slaves was: linux-ix-slave16 mw32-ix-slave01 mw32-ix-slave02 mw32-ix-slave10 mw32-ix-slave11 mw32-ix-slave13 mw32-ix-slave14 mw32-ix-slave15 mw32-ix-slave17 mw32-ix-slave19 mw32-ix-slave21 mw32-ix-slave22 mw32-ix-slave24 w32-ix-slave03 w32-ix-slave06 w32-ix-slave12 w32-ix-slave13 w32-ix-slave14 w32-ix-slave16 w32-ix-slave18 w32-ix-slave20 w32-ix-slave21 w32-ix-slave25 and all bug linux-ix-slave16 came back. Which suggests DHCP is to blame. I've added a new bug to fix this on centos5 - bug 636390.
mozilla-central tree has just been opened. Mobile tree is also green except mochitest2 and mochitest3 (expected).
Component: Server Operations: Netops → Server Operations
(In reply to comment #14) > I'm confused how the escalation went with this. The bug was opened at > 2011-02-23 17:21:41 PST which was very soon after I applied my change. I > stayed in the office for over an hour to verify there were no issues as a > result of this and left when all appeared clear. 3 hours later on a fluke > dmoore signed in and shortly after I did as well. This first looked like hg errors, which could have been isolated to a couple build slaves so Joduinn took those slaves out of the pool. Then all the win32 mv slaves went down, but we have coverage in scl1, so we didn't close the tree. However, downing all tegras+n900s would close the tree, so we moved the bug over to serverops + raised severity at that point. Those only went down after a reboot, so took a while to get back up. I think we need to be talking about how to keep track of changes so people can figure out what changed when that could have broken things. What's IT's change management system? Bugzilla?
Severity: blocker → major
(In reply to comment #19) > Those only went down after a reboot, so took a while to get back up. Er, "those were fine until they tried to reboot, and couldn't come back up" or something. Either way, it took time for the issue to show itself as a complete outage.
I think there's still an issue here. We're getting extremely slow network traffic between MV slaves and MPT machines (like hg.m.o, for example). Doesn't affect MPT or SCL based slaves. We've shut down these machines for now so this isn't a blocker yet, but still important. If the mobile machines are hosed too, this becomes a blocker.
Severity: major → critical
I've filed bug 636462 to track releng details about this morning's trouble. We're assuming for the moment that this is related to the filter changes from last night.
I think you guys are tailgating this new issue with yesterday's issue. While I understand your sensitivity to things in the last week, please don't conflate the two as it may negatively impact the troubleshooting process. The all values are within normal daily average in MTV1: firewall CPU bandwidth to SJC1 latency from a host in MTV1 to SJC1.
Apologies if this is unrelated, I wasn't sure based on the earlier comments if this issue was 100% fixed. I certainly don't mean to confuse matters.
(In reply to comment #4) > I've moved aside the buildbot.tac files for mw32-ix-slave14, mw32-ix-slave17 > and rebooted them. > > With them out of the way, we're rekicking those builds to see if it happens > again on other slaves. Spinning off bug#636475 to get these innocent slaves back into production.
This bug is confusing bug can largely be summarized by DHCP issues in MVT. Bug 636462 is tracking hg issues not unlike comment #3, calling this closed.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.