Closed
Bug 435052
Opened 17 years ago
Closed 17 years ago
network hiccup/twisted failure causing talos outage
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: anodelman, Assigned: mrz)
Details
[Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
]
Seen on:
Firefox
qm-mini-ubuntu01
qm-pmac05
qm-pmac-trunk05
qm-pmac-trunk06
qm-mini-xp02
qm-mini-vista05
qm-mini-vista02
Mozilla2
qm-plinux-trunk03
qm-pxp-trunk01
qm-pxp-trunk03
qm-pmac-trunk03
Seems to have hit sometime after 8:58am this morning.
The machines are currently recovering, so this bug is more to determine what caused the outage in the first place.
Comment 1•17 years ago
|
||
second time we've seen this...
Assignee: nobody → mrz
Severity: normal → major
Component: Release Engineering: Talos → Server Operations
Comment 2•17 years ago
|
||
Also saw some burning builders today,
balsa-18branch - CVS conflicts in mozilla/intl
crazyhorse - cc1plus: internal compiler error: Segmentation fault on xpcom/glue/standalone/nsGREDirServiceProvider.cpp
production-pacifica-vm - ../../dist/lib/gkxtfbase_s.lib : fatal error LNK1136: invalid or corrupt file
tbnewref-win32-tbox was also hitting CVS conflicts earlier
which could be due to network glitches. That was about 6:20 PDT.
Assignee | ||
Comment 3•17 years ago
|
||
This set of machines are on seperate switches (asx103-05* and asx103-06*).
The build VMs are all on core1/core2.
Alice, what did you boxes lose connectivity to?
Reporter | ||
Comment 4•17 years ago
|
||
They lost connectivity with the masters on qm-rhel02 and qm-buildbot01.
This is what makes me think of network errors - since it spans two different buildbot masters on two different machines. If it was a single master it would be easier to believe that it was an error in buildbot.
Comment 5•17 years ago
|
||
Having these talos machines down would close the tree, if it wasnt already closed for FF3.0rc1.
As we're having other build machines go red (see bug#435134) also, it feels like this should be a blocker also.
Severity: major → blocker
Assignee | ||
Comment 6•17 years ago
|
||
There isn't a single log entry since this morning from any of the switches. In fact, the last log entry is from 16:00 for an interface on dm-nagios01.
Updated•17 years ago
|
QA Contact: release → server-ops
Assignee | ||
Comment 7•17 years ago
|
||
(In reply to comment #4)
> They lost connectivity with the masters on qm-rhel02 and qm-buildbot01.
The latter VM is on qm-vmware01. It should be moved over to releng/build ESX servers. Can we do that to eliminate any issues with how that ESX host is built?
Comment 8•17 years ago
|
||
Yes, go ahead and move qm-buildbot01 if you want to get exclude qm-vmware01 from the picture.
Assignee | ||
Comment 9•17 years ago
|
||
26GB VM, probably under an hour to move - you're okay with that right?
Assignee | ||
Comment 10•17 years ago
|
||
alice|afk: that's the centre of the current moz-central vs. ff3 testing, so i'd prefer if we do a planned outage
[9:14pm] alice|afk: can we delay?
Comment 11•17 years ago
|
||
Meanwhile any ideas why we lost contact with qm-rhel02? Could it be the same disk/array/host as is suspect in bug#435134?
Assignee | ||
Comment 12•17 years ago
|
||
qm-rhel02 is on netapp-d-fcal1 on bm-vmware03. The others in bug 435134 all had disk I/O issues and as far as I know, qm-rhel02 did not.
I haven't any idea why you lost contact with it or whether qm-rhel02 lost contact. Which end failed would be really helpful in narrowing things down. Quite possibly qm-rhel02 was fine and had good connectivity and the other hosts couldn't reach it. Or qm-rhel02 was the one that went offline and the others were fine. Unfortunately I don't have anything that points to one or the other.
Assignee | ||
Comment 13•17 years ago
|
||
I want to call this closed. The two boxes you lost access to were spread across netapp-d and netapp-c and very likely the cause.
If I'm wrong and this happens again, re-open!
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•