605064 - Network in Castro failing intermittently

Reporter

Description

•

15 years ago

Eg ~ 17:14 this evening [13] mv-buildproxy01.build is DOWN: PING CRITICAL - Packet loss = 100% [14] talos-staging-master02 is DOWN: PING CRITICAL - Packet loss = 100% [15] test-master01.build is DOWN: PING CRITICAL - Packet loss = 100% [16] talos-master02.build is DOWN: PING CRITICAL - Packet loss = 100% [17] geriatric-master.build is DOWN: PING CRITICAL - Packet loss = 100% [18] staging-mobile-master.build is DOWN: PING CRITICAL - Packet loss = 100% [20] test-master02.build is DOWN: PING CRITICAL - Packet loss = 100% [21] production-mobile-master.build is DOWN: PING CRITICAL - Packet loss = 100% followed by recovery messages. Also ~1617 and ~1700. From a RelEng point of view this is causing slaves in Castro to lose connection to their buildbot master (they maintain a persistent connection), regardless of where the master is, and they drop any job they're running. So far it hasn't been many builds because the number running is still small after the tree opening. <ravi> there is a case open with the vendor <ravi> yeah. so the system is failing over to the standby so it shouldnt' be a hard down for mroe than a very small period

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

15 years ago

This is a change since bug 591468, the network changes in the Castro office. Bugzilla decrees that I can't set that blocking this because 591468 is not visible to me.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

15 years ago

Summary: Network in Castro failing periodically → Network in Castro failing intermittently

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

15 years ago

And at ~ 19:49.

matthew zeier [:mrz]

Comment 3

•

15 years ago

I opened that bug up. Issues being worked on. Is this affecting the tree?

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 4

•

15 years ago

(discussed on irc). The short is yes, because network interruptions end up dropping compiles/tests/perf tests on the floor. They can be rerun by RelEng by manual intervention, but that won't scale to a full weekday of load. Also from IRC, IT have disabled one half of the High Availability setup. There seems to be 'some condition that is causing the standby unit to think the primary failed and it takes over, but nothing failed'.

Shyam Mani [:fox2mike]

Comment 5

•

15 years ago

Ravi is running point here.

Assignee: server-ops → ravi

Component: Server Operations → Server Operations: Netops

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 6

•

15 years ago

Bounced again at ~ 0245. Different failure mode ?

bhearsum@mozilla.com (:bhearsum)

Comment 7

•

15 years ago

Again at 6:00am, October 18.

bhearsum@mozilla.com (:bhearsum)

Comment 8

•

15 years ago

Again at 7:52am, Oct 18

Chris AtLee [:catlee]

Comment 9

•

15 years ago

Again at 08:11, Oct 18. Upgrading to blocker, this is causing lots of stuff to fail.

Severity: critical → blocker

Chris AtLee [:catlee]

Updated

•

15 years ago

Assignee: ravi → network-operations

Shawn Wilsher :sdwilsh

Comment 10

•

15 years ago

It's also killing Internet access for people at the office. Makes getting work done difficult.

Shyam Mani [:fox2mike]

Comment 12

•

15 years ago

Folks are working on it, no need to page oncall.

Assignee: network-operations → ravi

Mike Beltzner [:beltzner, not reading bugmail]

Comment 13

•

15 years ago

It would be useful if someone from the IT team could email mv-all@ to let people know what the symptoms are, and what the ETA is for correction. That way decisions can be made about whether to commute into the office, etc.

Derek Moore [:dmoore]

Comment 14

•

15 years ago

Ravi is going to arrive onsite before I do, he should be able to update shortly.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 15

•

15 years ago

(In reply to comment #13) > It would be useful if someone from the IT team could email mv-all@ to let > people know what the symptoms are, and what the ETA is for correction. That way > decisions can be made about whether to commute into the office, etc. I just talked with Ravi in the car; he estimates that things should be back working again ~30mins. More info as he arrives, but yes, commuting to office still worthwhile.

Mike Beltzner [:beltzner, not reading bugmail]

Comment 16

•

15 years ago

Great. Who's writing the email to mv-all@? Or are you counting on everyone to be cc'd to this bug.

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

15 years ago

Whiteboard: ETA: to be resolved by 9:30am PT, Monday October 18

Derek Moore [:dmoore]

Comment 17

•

15 years ago

mrz has just applied the suggested fix from the vendor. If the problem persists, we will revert the firewall upgrade which was performed over the weekend. Either way, we do not expect this problem to continue into the work day.

Ravi Pina [:ravi]

Assignee

Comment 18

•

15 years ago

Lowering priority as things have been stable for 7 hours.

Severity: blocker → major

Status: NEW → ASSIGNED

John Ford [:jhford] CET/CEST Berlin Time

Comment 19

•

15 years ago

don't know if its related, but i just tried to send and email through smtp.mozilla.org and it failed 3 times. Without changing any settings in my application and with no delay the 4th click on the send button did properly send the email.

Mike Beltzner [:beltzner, not reading bugmail]

Comment 20

•

15 years ago

etherpad.mozilla.com craps out from time to time, as do other mozilla sites; not sure if this is fully fixed.

Ravi Pina [:ravi]

Assignee

Comment 21

•

15 years ago

http://etherpad.mozilla.com/ gives me 'invalid superdomain' I'm not clear what, if anything different, it should look like.

matthew zeier [:mrz]

Comment 22

•

15 years ago

Toronto doesn't rely on Mountain View to hit the San Jose datacenter. http://etherpad.mozilla.com:9000/ works for me from Mountain View (no VPNs up). When it times out for you, could you paste in 'traceroute etherpad.mozilla.com" ?

Ravi Pina [:ravi]

Assignee

Comment 23

•

15 years ago

(In reply to comment #19) Is your sample size only those 4 connections today?

Shawn Wilsher :sdwilsh

Comment 24

•

15 years ago

(In reply to comment #23) > Is your sample size only those 4 connections today? I've had mxr.mozilla.org return a sever not found several times today throughout the day. Hitting refresh would make it work again. Of course, I'm not sure if that's the network, or minefield...

Shawn Wilsher :sdwilsh

Comment 25

•

15 years ago

And just trying to post that comment via the Bugzilla REST API has failed again on the first attempt. This is the second time I've hit that today.

Ravi Pina [:ravi]

Assignee

Comment 26

•

15 years ago

bugzilla, mxr, mail, etc are all external hosts not in MTV or related to the issues in MTV. It is possible throughout the day there may have been some blips (we had a power outage this afternoon). Unless the host you are trying to reach is in 10.250.0.0/16 or you in MTV1 trying to reach resources it is very unlikely this is the appropriate bug for those issues and more likely an unrelated issue.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 27

•

15 years ago

Everyone from mv.mozilla.com in #developers timed out at 7:39 and change EST.

matthew zeier [:mrz]

Comment 28

•

15 years ago

I do believe this was all resolved by 5:30pm when I was heading out. At some point we flipped over to the backup Internet provider. Undoing that contributed to some of the issues you saw late in the day. Calling fixed, re-open if you see problems related to this.

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Shawn Wilsher :sdwilsh

Comment 29

•

15 years ago

(In reply to comment #25) > And just trying to post that comment via the Bugzilla REST API has failed again > on the first attempt. This is the second time I've hit that today. Still getting this. It's the same behavior that I'm seeing in the browser too. Our web properties keep returning a null response (as viewed from an XHR) randomly. The browser will display a server not found page when this happens.

matthew zeier [:mrz]

Comment 30

•

15 years ago

You only see this to Mozilla hosts?

Shawn Wilsher :sdwilsh

Comment 31

•

15 years ago

(In reply to comment #30) > You only see this to Mozilla hosts? It's the only ones I've hit this on, but like I said it is random. My sample size is largely mozilla hosts too though.

Shawn Wilsher :sdwilsh

Comment 32

•

15 years ago

(In reply to comment #31) > It's the only ones I've hit this on, but like I said it is random. My sample > size is largely mozilla hosts too though. Just hit it on forecast.weather.gov (not one of our web properties).

Shawn Wilsher :sdwilsh

Comment 33

•

15 years ago

And I can now confirm that this isn't just me and my machine. The tv that displays tinderbox pushlog hit this this morning booting up.

Ravi Pina [:ravi]

Assignee

Comment 34

•

15 years ago

What does hit mean in this context?

Whiteboard: ETA: to be resolved by 9:30am PT, Monday October 18

Shawn Wilsher :sdwilsh

Comment 35

•

15 years ago

s/hit/exerpienced/

Mike Beltzner [:beltzner, not reading bugmail]

Comment 36

•

15 years ago

Damon's getting connections reset all the time. I'm getting timeouts at Bugzilla. Should these be new bugs? Coincidences? Frustrating!

matthew zeier [:mrz]

Comment 37

•

15 years ago

New bugs - you're both in different locations and this bug is about Castro.

matthew zeier [:mrz]

Comment 38

•

15 years ago

(but Ravi's looking at it)

Ravi Pina [:ravi]

Assignee

Comment 39

•

15 years ago

Connections reset to where? Define reset. Bugzilla is only an external IP so you're not traversing the tunnel, but rather just the open internet. What aspect is timing out? Page loads? Are you able to get to it at all, some, or none of the time? I'll look at the router over there to see if there are global issues for the office. Please put the details in new bugs, please.

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

15 years ago

Depends on: 607119

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

15 years ago

No longer depends on: 607119

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

3 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard