MTV build/test slave outage

RESOLVED FIXED

Status

--
critical
RESOLVED FIXED
8 years ago
4 years ago

People

(Reporter: Ehsan, Assigned: ravi)

Tracking

Details

(Whiteboard: [slaveduty], URL)

(Reporter)

Description

8 years ago
See the URL.  I'm closing the tree until the issue is resolved.
(Reporter)

Updated

8 years ago
Severity: normal → blocker
taking after being ping'd in irc.
Assignee: nobody → joduinn
OS: Mac OS X → Windows Server 2003
failure logs:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1298508382.1298509599.12859.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1298507507.1298508722.8920.gz

In both cases, the failure is:
> python: can't open file 'tools/buildfarm/maintenance/count_and_reboot.py': [Errno 2] No such file or directory
> program finished with exit code 2
1) These failing builds are visible in buildapi, but not displayed on tbpl (unknown why). However found on tinderbox page with help from dholbert.

2) Slaves: mw32-ix-slave14, mw32-ix-slave17, both failed out at approx the same time, with the same error msg:

....
adding changesets
adding manifests
transaction abort!
rollback completed
** unknown exception encountered, please report by visiting
**  http://mercurial.selenic.com/wiki/BugTracker
** Python 2.6.6 (r266:84297, Aug 24 2010, 18:46:32) [MSC v.1500 32 bit (Intel)]
** Mercurial Distributed SCM (version 1.7.5)
** Extensions loaded: win32text, graphlog, share, purge
Traceback (most recent call last):
  File "hg", line 38, in <module>
  File "mercurial\dispatch.pyc", line 16, in run
  File "mercurial\dispatch.pyc", line 36, in dispatch
  File "mercurial\dispatch.pyc", line 58, in _runcatch
  File "mercurial\dispatch.pyc", line 593, in _dispatch
  File "mercurial\dispatch.pyc", line 401, in runcommand
  File "mercurial\dispatch.pyc", line 644, in _runcommand
  File "mercurial\dispatch.pyc", line 598, in checkargs
  File "mercurial\dispatch.pyc", line 591, in <lambda>
  File "mercurial\util.pyc", line 426, in check
  File "mercurial\commands.pyc", line 736, in clone
  File "mercurial\hg.pyc", line 337, in clone
  File "mercurial\localrepo.pyc", line 1886, in clone
  File "mercurial\localrepo.pyc", line 1295, in pull
  File "mercurial\localrepo.pyc", line 1711, in addchangegroup
  File "mercurial\revlog.pyc", line 1381, in addgroup
  File "mercurial\revlog.pyc", line 1220, in _addrevision
mpatch.mpatchError: patch cannot be decoded
program finished with exit code 255
elapsedTime=357.000000
=== Output ended ===
....

Sample logs are here: 
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1298508382.1298509599.12859.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1298507507.1298508722.8920.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1298507784.1298508018.6325.gz

Dont know why this is happening, and just to these two slaves, starting approx same time. As far as I know, we didnt deploy new version of mercurial to slaves today.

for now, I'm going to take these two slaves out of production, and rekick the builds to see what happens.
I've moved aside the buildbot.tac files for mw32-ix-slave14, mw32-ix-slave17 and rebooted them. 

With them out of the way, we're rekicking those builds to see if it happens again on other slaves.
Whiteboard: [slaveduty]
(In reply to comment #4)
> I've moved aside the buildbot.tac files for mw32-ix-slave14, mw32-ix-slave17
> and rebooted them. 
> 
> With them out of the way, we're rekicking those builds to see if it happens
> again on other slaves.

5 failed jobs rekicked.
May or may not be related. In #build, I see a bunch of other machines in the same server room in 650castro fail to respond to ping.

18:18:38 < nagios> [41] mw32-ix-slave22.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
18:18:51 < nagios> [42] w32-ix-slave18.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
18:19:36 < nagios> [43] w32-ix-slave06.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
18:22:09 < nagios> [46] w32-ix-slave13.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
18:23:37 < nagios> [47] mw32-ix-slave11.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
18:30:21 < nagios> [48] mw32-ix-slave13.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
18:31:18 < nagios> [49] mw32-ix-slave15.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
18:33:44 < nagios> [50] moz2-win32-slave24.build:hung slave is CRITICAL: twistd.log last changed Saturday, February 19, 2011 02:17:34, : 1  critical
18:38:01 < nagios> [51] w32-ix-slave03.build:PING is CRITICAL: PING CRITICAL - Packet loss  = 100%
18:43:16  * joduinn looks up and wonders why we are suddenly losing a bunch of win32 slaves in 650castro? 


Passing the baton to nthomas (with thanks), as he's still at his desk, and I'm now late for class.
Assignee: joduinn → nrthomas
We started off with failures to clone from hg.mozilla.org (while still maintaining a network connection to the buildbot master in MPT) and now mw32-ix-slave14 and mw32-ix-slave17 are inaccessible. So are several other Windows hosts, from nagios PING checks:
mw32-ix-slave01.build
mw32-ix-slave11.build
mw32-ix-slave13.build
mw32-ix-slave15.build
mw32-ix-slave19.build
mw32-ix-slave21.build
mw32-ix-slave22.build
mw32-ix-slave24.build
w32-ix-slave03.build
w32-ix-slave06.build
w32-ix-slave12.build
w32-ix-slave13.build
w32-ix-slave18.build
	
Those starting failing from about 1730 PST. Aki also reports Tegra's and n900's down. Some are still falling over slowly, while others are still responsive (eg mw32-ix-slave03, can't find any linux boxes down).
Assignee: nrthomas → server-ops
Severity: blocker → critical
Component: Release Engineering → Server Operations
QA Contact: release → mrz
The win32 boxes don't close the tree, we have enough capacity elsewhere to make do. Missing Tegra's and n900s does if it's more than an hour or so. Aki is in the office and taking a quick look.
Summary: Windows builds are failing over and over on mozilla-central → Some RelEng assets in Mt View inaccessible
Unable to raise the n900s/tegras. -> blocker
Severity: critical → blocker
Summary: Some RelEng assets in Mt View inaccessible → MTV build/test slave outage
(In reply to comment #8)
> The win32 boxes don't close the tree, we have enough capacity elsewhere to make
> do. Missing Tegra's and n900s does if it's more than an hour or so. 

Agreed, tree closed because of lack of tegra and n900 coverage. 

> Aki is in the office and taking a quick look.
Correction. Aki is in SF, not in the office.
OS: Windows Server 2003 → All

Updated

8 years ago
Assignee: server-ops → shyam
It looks like machines that have been up stay up, but if they reboot they don't come back up.  Possibly DHCP.
Machines are coming back up.
Root cause looks like new firewall rules that blocked dhcp (and I suppose broke hg per comment 3 ?)

Lowering priority. We need a postmortem + verify that everything is good.
Severity: blocker → major
from irc with dmoore, zandr, aki, joduinn, ravi, fox2mike. 

Looks like there were filter changes made to firewall in MV this evening ~17:17 which caused rebooting machines to not get new IPs. dmoore reverted for now. Full postmortem tmrw. Once builds report green again, we'll reopen tree and can then reduce from blocker.
Severity: major → blocker
(Assignee)

Comment 14

8 years ago
I'm confused how the escalation went with this.  The bug was opened at 2011-02-23 17:21:41 PST which was very soon after I applied my change.  I stayed in the office for over an hour to verify there were no issues as a result of this and left when all appeared clear.  3 hours later on a fluke dmoore signed in and shortly after I did as well.
Over to NetOps, assigning to dmoore since he fixed it.
Assignee: shyam → dmoore
Component: Server Operations → Server Operations: Netops
(Assignee)

Comment 16

8 years ago
I added the filter that caused the DHCP issue so over to me to make it not do that when I reapply.
Assignee: dmoore → ravi
For the history books, the complete list of down'd slaves was:

linux-ix-slave16
mw32-ix-slave01
mw32-ix-slave02
mw32-ix-slave10
mw32-ix-slave11
mw32-ix-slave13
mw32-ix-slave14
mw32-ix-slave15
mw32-ix-slave17
mw32-ix-slave19
mw32-ix-slave21
mw32-ix-slave22
mw32-ix-slave24
w32-ix-slave03
w32-ix-slave06
w32-ix-slave12
w32-ix-slave13
w32-ix-slave14
w32-ix-slave16
w32-ix-slave18
w32-ix-slave20
w32-ix-slave21
w32-ix-slave25

and all bug linux-ix-slave16 came back.  Which suggests DHCP is to blame.  I've added a new bug to fix this on centos5 - bug 636390.
mozilla-central tree has just been opened. Mobile tree is also green except mochitest2 and mochitest3 (expected).
Component: Server Operations: Netops → Server Operations
(In reply to comment #14)
> I'm confused how the escalation went with this.  The bug was opened at
> 2011-02-23 17:21:41 PST which was very soon after I applied my change.  I
> stayed in the office for over an hour to verify there were no issues as a
> result of this and left when all appeared clear.  3 hours later on a fluke
> dmoore signed in and shortly after I did as well.

This first looked like hg errors, which could have been isolated to a couple build slaves so Joduinn took those slaves out of the pool.  Then all the win32 mv slaves went down, but we have coverage in scl1, so we didn't close the tree.

However, downing all tegras+n900s would close the tree, so we moved the bug over to serverops + raised severity at that point.

Those only went down after a reboot, so took a while to get back up.

I think we need to be talking about how to keep track of changes so people can figure out what changed when that could have broken things.  What's IT's change management system?  Bugzilla?
Severity: blocker → major
(In reply to comment #19)
> Those only went down after a reboot, so took a while to get back up.

Er, "those were fine until they tried to reboot, and couldn't come back up" or something.  Either way, it took time for the issue to show itself as a complete outage.
I think there's still an issue here. We're getting extremely slow network traffic between MV slaves and MPT machines (like hg.m.o, for example). Doesn't affect MPT or SCL based slaves. We've shut down these machines for now so this isn't a blocker yet, but still important.

If the mobile machines are hosed too, this becomes a blocker.
Severity: major → critical
I've filed bug 636462 to track releng details about this morning's trouble.  We're assuming for the moment that this is related to the filter changes from last night.
(Assignee)

Comment 23

8 years ago
I think you guys are tailgating this new issue with yesterday's issue.  While I understand your sensitivity to things in the last week, please don't conflate the two as it may negatively impact the troubleshooting process.

The all values are within normal daily average in MTV1:

firewall CPU
bandwidth to SJC1
latency from a host in MTV1 to SJC1.
Apologies if this is unrelated, I wasn't sure based on the earlier comments if this issue was 100% fixed. I certainly don't mean to confuse matters.
(In reply to comment #4)
> I've moved aside the buildbot.tac files for mw32-ix-slave14, mw32-ix-slave17
> and rebooted them. 
> 
> With them out of the way, we're rekicking those builds to see if it happens
> again on other slaves.

Spinning off bug#636475 to get these innocent slaves back into production.
This bug is confusing bug can largely be summarized by DHCP issues in MVT.  Bug 636462 is tracking hg issues not unlike comment #3, calling this closed.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.