If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Network in Castro failing intermittently

RESOLVED FIXED

Status

Infrastructure & Operations
NetOps
--
major
RESOLVED FIXED
7 years ago
4 years ago

People

(Reporter: nthomas, Assigned: ravi)

Tracking

Details

(Reporter)

Description

7 years ago
Eg ~ 17:14 this evening

[13] mv-buildproxy01.build is DOWN: PING CRITICAL - Packet loss = 100%
[14] talos-staging-master02 is DOWN: PING CRITICAL - Packet loss = 100%
[15] test-master01.build is DOWN: PING CRITICAL - Packet loss = 100%
[16] talos-master02.build is DOWN: PING CRITICAL - Packet loss = 100%
[17] geriatric-master.build is DOWN: PING CRITICAL - Packet loss = 100%
[18] staging-mobile-master.build is DOWN: PING CRITICAL - Packet loss = 100%
[20] test-master02.build is DOWN: PING CRITICAL - Packet loss = 100%
[21] production-mobile-master.build is DOWN: PING CRITICAL - Packet loss = 100%

followed by recovery messages. Also ~1617 and ~1700. From a RelEng point of view this is causing slaves in Castro to lose connection to their buildbot master (they maintain a persistent connection), regardless of where the master is, and they drop any job they're running. So far it hasn't been many builds because the number running is still small after the tree opening.

<ravi>	there is a case open with the vendor
<ravi>	yeah. so the system is failing over to the standby so it shouldnt' be a hard down for mroe than a very small period
(Reporter)

Comment 1

7 years ago
This is a change since bug 591468, the network changes in the Castro office. Bugzilla decrees that I can't set that blocking this because 591468 is not visible to me.
(Reporter)

Updated

7 years ago
Summary: Network in Castro failing periodically → Network in Castro failing intermittently
(Reporter)

Comment 2

7 years ago
And at ~ 19:49.

Comment 3

7 years ago
I opened that bug up.  Issues being worked on.  Is this affecting the tree?
(Reporter)

Comment 4

7 years ago
(discussed on irc). The short is yes, because network interruptions end up dropping compiles/tests/perf tests on the floor. They can be rerun by RelEng by manual intervention, but that won't scale to a full weekday of load.

Also from IRC, IT have disabled one half of the High Availability setup. There seems to be 'some condition that is causing the standby unit to think the primary failed and it takes over, but nothing failed'.
Ravi is running point here.
Assignee: server-ops → ravi
Component: Server Operations → Server Operations: Netops
(Reporter)

Comment 6

7 years ago
Bounced again at ~ 0245. Different failure mode ?
Again at 6:00am, October 18.
Again at 7:52am, Oct 18
Again at 08:11, Oct 18.

Upgrading to blocker, this is causing lots of stuff to fail.
Severity: critical → blocker

Updated

7 years ago
Assignee: ravi → network-operations
It's also killing Internet access for people at the office.  Makes getting work done difficult.
Duplicate of this bug: 605142
Folks are working on it, no need to page oncall.
Assignee: network-operations → ravi
It would be useful if someone from the IT team could email mv-all@ to let people know what the symptoms are, and what the ETA is for correction. That way decisions can be made about whether to commute into the office, etc.
Ravi is going to arrive onsite before I do, he should be able to update shortly.
(In reply to comment #13)
> It would be useful if someone from the IT team could email mv-all@ to let
> people know what the symptoms are, and what the ETA is for correction. That way
> decisions can be made about whether to commute into the office, etc.

I just talked with Ravi in the car; he estimates that things should be back
working again ~30mins. More info as he arrives, but yes, commuting to office
still worthwhile.
Great. Who's writing the email to mv-all@? Or are you counting on everyone to be cc'd to this bug.
Whiteboard: ETA: to be resolved by 9:30am PT, Monday October 18
mrz has just applied the suggested fix from the vendor. If the problem persists, we will revert the firewall upgrade which was performed over the weekend.

Either way, we do not expect this problem to continue into the work day.
(Assignee)

Comment 18

7 years ago
Lowering priority as things have been stable for 7 hours.
Severity: blocker → major
Status: NEW → ASSIGNED
don't know if its related, but i just tried to send and email through smtp.mozilla.org and it failed 3 times.  Without changing any settings in my application and with no delay the 4th click on the send button did properly send the email.
etherpad.mozilla.com craps out from time to time, as do other mozilla sites; not sure if this is fully fixed.
(Assignee)

Comment 21

7 years ago
http://etherpad.mozilla.com/ gives me 'invalid superdomain'  I'm not clear what, if anything different, it should look like.
Toronto doesn't rely on Mountain View to hit the San Jose datacenter.  

http://etherpad.mozilla.com:9000/ works for me from Mountain View (no VPNs up).  When it times out for you, could you paste in 'traceroute etherpad.mozilla.com" ?
(Assignee)

Comment 23

7 years ago
(In reply to comment #19)

Is your sample size only those 4 connections today?
(In reply to comment #23)
> Is your sample size only those 4 connections today?
I've had mxr.mozilla.org return a sever not found several times today throughout the day.  Hitting refresh would make it work again.

Of course, I'm not sure if that's the network, or minefield...
And just trying to post that comment via the Bugzilla REST API has failed again on the first attempt.  This is the second time I've hit that today.
(Assignee)

Comment 26

7 years ago
bugzilla, mxr, mail, etc are all external hosts not in MTV or related to the issues in MTV.  It is possible throughout the day there may have been some blips (we had a power outage this afternoon).  Unless the host you are trying to reach is in 10.250.0.0/16 or you in MTV1 trying to reach resources it is very unlikely this is the appropriate bug for those issues and more likely an unrelated issue.
Everyone from mv.mozilla.com in #developers timed out at 7:39 and change EST.
I do believe this was all resolved by 5:30pm when I was heading out.  At some point we flipped over to the backup Internet provider.  

Undoing that contributed to some of the issues you saw late in the day.

Calling fixed, re-open if you see problems related to this.
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
(In reply to comment #25)
> And just trying to post that comment via the Bugzilla REST API has failed again
> on the first attempt.  This is the second time I've hit that today.
Still getting this.  It's the same behavior that I'm seeing in the browser too.  Our web properties keep returning a null response (as viewed from an XHR) randomly.  The browser will display a server not found page when this happens.
You only see this to Mozilla hosts?
(In reply to comment #30)
> You only see this to Mozilla hosts?
It's the only ones I've hit this on, but like I said it is random.  My sample size is largely mozilla hosts too though.
(In reply to comment #31)
> It's the only ones I've hit this on, but like I said it is random.  My sample
> size is largely mozilla hosts too though.
Just hit it on forecast.weather.gov (not one of our web properties).
And I can now confirm that this isn't just me and my machine.  The tv that displays tinderbox pushlog hit this this morning booting up.
(Assignee)

Comment 34

7 years ago
What does hit mean in this context?
Whiteboard: ETA: to be resolved by 9:30am PT, Monday October 18
s/hit/exerpienced/
Damon's getting connections reset all the time.

I'm getting timeouts at Bugzilla.

Should these be new bugs? Coincidences? Frustrating!
New bugs - you're both in different locations and this bug is about Castro.
(but Ravi's looking at it)
(Assignee)

Comment 39

7 years ago
Connections reset to where?  Define reset.

Bugzilla is only an external IP so you're not traversing the tunnel, but rather just the open internet.  What aspect is timing out?  Page loads?  Are you able to get to it at all, some, or none of the time?

I'll look at the router over there to see if there are global issues for the office.

Please put the details in new bugs, please.
Depends on: 607119
No longer depends on: 607119
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.