Closed Bug 435052 Opened 17 years ago Closed 17 years ago

network hiccup/twisted failure causing talos outage

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: anodelman, Assigned: mrz)

Details

[Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. ] Seen on: Firefox qm-mini-ubuntu01 qm-pmac05 qm-pmac-trunk05 qm-pmac-trunk06 qm-mini-xp02 qm-mini-vista05 qm-mini-vista02 Mozilla2 qm-plinux-trunk03 qm-pxp-trunk01 qm-pxp-trunk03 qm-pmac-trunk03 Seems to have hit sometime after 8:58am this morning. The machines are currently recovering, so this bug is more to determine what caused the outage in the first place.
second time we've seen this...
Assignee: nobody → mrz
Severity: normal → major
Component: Release Engineering: Talos → Server Operations
Also saw some burning builders today, balsa-18branch - CVS conflicts in mozilla/intl crazyhorse - cc1plus: internal compiler error: Segmentation fault on xpcom/glue/standalone/nsGREDirServiceProvider.cpp production-pacifica-vm - ../../dist/lib/gkxtfbase_s.lib : fatal error LNK1136: invalid or corrupt file tbnewref-win32-tbox was also hitting CVS conflicts earlier which could be due to network glitches. That was about 6:20 PDT.
This set of machines are on seperate switches (asx103-05* and asx103-06*). The build VMs are all on core1/core2. Alice, what did you boxes lose connectivity to?
They lost connectivity with the masters on qm-rhel02 and qm-buildbot01. This is what makes me think of network errors - since it spans two different buildbot masters on two different machines. If it was a single master it would be easier to believe that it was an error in buildbot.
Having these talos machines down would close the tree, if it wasnt already closed for FF3.0rc1. As we're having other build machines go red (see bug#435134) also, it feels like this should be a blocker also.
Severity: major → blocker
There isn't a single log entry since this morning from any of the switches. In fact, the last log entry is from 16:00 for an interface on dm-nagios01.
QA Contact: release → server-ops
(In reply to comment #4) > They lost connectivity with the masters on qm-rhel02 and qm-buildbot01. The latter VM is on qm-vmware01. It should be moved over to releng/build ESX servers. Can we do that to eliminate any issues with how that ESX host is built?
Yes, go ahead and move qm-buildbot01 if you want to get exclude qm-vmware01 from the picture.
26GB VM, probably under an hour to move - you're okay with that right?
alice|afk: that's the centre of the current moz-central vs. ff3 testing, so i'd prefer if we do a planned outage [9:14pm] alice|afk: can we delay?
Meanwhile any ideas why we lost contact with qm-rhel02? Could it be the same disk/array/host as is suspect in bug#435134?
qm-rhel02 is on netapp-d-fcal1 on bm-vmware03. The others in bug 435134 all had disk I/O issues and as far as I know, qm-rhel02 did not. I haven't any idea why you lost contact with it or whether qm-rhel02 lost contact. Which end failed would be really helpful in narrowing things down. Quite possibly qm-rhel02 was fine and had good connectivity and the other hosts couldn't reach it. Or qm-rhel02 was the one that went offline and the others were fine. Unfortunately I don't have anything that points to one or the other.
I want to call this closed. The two boxes you lost access to were spread across netapp-d and netapp-c and very likely the cause. If I'm wrong and this happens again, re-open!
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.