435052 - network hiccup/twisted failure causing talos outage

Reporter

Description

•

17 years ago

[Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. ] Seen on: Firefox qm-mini-ubuntu01 qm-pmac05 qm-pmac-trunk05 qm-pmac-trunk06 qm-mini-xp02 qm-mini-vista05 qm-mini-vista02 Mozilla2 qm-plinux-trunk03 qm-pxp-trunk01 qm-pxp-trunk03 qm-pmac-trunk03 Seems to have hit sometime after 8:58am this morning. The machines are currently recovering, so this bug is more to determine what caused the outage in the first place.

Justin Fitzhugh

Comment 1

•

17 years ago

second time we've seen this...

Assignee: nobody → mrz

Severity: normal → major

Component: Release Engineering: Talos → Server Operations

Nick Thomas [:nthomas] (UTC+12)

Comment 2

•

17 years ago

Also saw some burning builders today, balsa-18branch - CVS conflicts in mozilla/intl crazyhorse - cc1plus: internal compiler error: Segmentation fault on xpcom/glue/standalone/nsGREDirServiceProvider.cpp production-pacifica-vm - ../../dist/lib/gkxtfbase_s.lib : fatal error LNK1136: invalid or corrupt file tbnewref-win32-tbox was also hitting CVS conflicts earlier which could be due to network glitches. That was about 6:20 PDT.

matthew zeier [:mrz]

Assignee

Comment 3

•

17 years ago

This set of machines are on seperate switches (asx103-05* and asx103-06*). The build VMs are all on core1/core2. Alice, what did you boxes lose connectivity to?

alice nodelman [:alice] [:anode]

Reporter

Comment 4

•

17 years ago

They lost connectivity with the masters on qm-rhel02 and qm-buildbot01. This is what makes me think of network errors - since it spans two different buildbot masters on two different machines. If it was a single master it would be easier to believe that it was an error in buildbot.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 5

•

17 years ago

Having these talos machines down would close the tree, if it wasnt already closed for FF3.0rc1. As we're having other build machines go red (see bug#435134) also, it feels like this should be a blocker also.

Severity: major → blocker

matthew zeier [:mrz]

Assignee

Comment 6

•

17 years ago

There isn't a single log entry since this morning from any of the switches. In fact, the last log entry is from 16:00 for an interface on dm-nagios01.

Justin Fitzhugh

Updated

•

17 years ago

QA Contact: release → server-ops

matthew zeier [:mrz]

Assignee

Comment 7

•

17 years ago

(In reply to comment #4) > They lost connectivity with the masters on qm-rhel02 and qm-buildbot01. The latter VM is on qm-vmware01. It should be moved over to releng/build ESX servers. Can we do that to eliminate any issues with how that ESX host is built?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 8

•

17 years ago

Yes, go ahead and move qm-buildbot01 if you want to get exclude qm-vmware01 from the picture.

matthew zeier [:mrz]

Assignee

Comment 9

•

17 years ago

26GB VM, probably under an hour to move - you're okay with that right?

matthew zeier [:mrz]

Assignee

Comment 10

•

17 years ago

alice|afk: that's the centre of the current moz-central vs. ff3 testing, so i'd prefer if we do a planned outage [9:14pm] alice|afk: can we delay?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 11

•

17 years ago

Meanwhile any ideas why we lost contact with qm-rhel02? Could it be the same disk/array/host as is suspect in bug#435134?

matthew zeier [:mrz]

Assignee

Comment 12

•

17 years ago

qm-rhel02 is on netapp-d-fcal1 on bm-vmware03. The others in bug 435134 all had disk I/O issues and as far as I know, qm-rhel02 did not. I haven't any idea why you lost contact with it or whether qm-rhel02 lost contact. Which end failed would be really helpful in narrowing things down. Quite possibly qm-rhel02 was fine and had good connectivity and the other hosts couldn't reach it. Or qm-rhel02 was the one that went offline and the others were fine. Unfortunately I don't have anything that points to one or the other.

matthew zeier [:mrz]

Assignee

Comment 13

•

17 years ago

I want to call this closed. The two boxes you lost access to were spread across netapp-d and netapp-c and very likely the cause. If I'm wrong and this happens again, re-open!

Status: NEW → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

network hiccup/twisted failure causing talos outage

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: anodelman, Assigned: mrz)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated