Closed
Bug 605064
Opened 15 years ago
Closed 15 years ago
Network in Castro failing intermittently
Categories
(Infrastructure & Operations Graveyard :: NetOps, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: ravi)
References
Details
Eg ~ 17:14 this evening
[13] mv-buildproxy01.build is DOWN: PING CRITICAL - Packet loss = 100%
[14] talos-staging-master02 is DOWN: PING CRITICAL - Packet loss = 100%
[15] test-master01.build is DOWN: PING CRITICAL - Packet loss = 100%
[16] talos-master02.build is DOWN: PING CRITICAL - Packet loss = 100%
[17] geriatric-master.build is DOWN: PING CRITICAL - Packet loss = 100%
[18] staging-mobile-master.build is DOWN: PING CRITICAL - Packet loss = 100%
[20] test-master02.build is DOWN: PING CRITICAL - Packet loss = 100%
[21] production-mobile-master.build is DOWN: PING CRITICAL - Packet loss = 100%
followed by recovery messages. Also ~1617 and ~1700. From a RelEng point of view this is causing slaves in Castro to lose connection to their buildbot master (they maintain a persistent connection), regardless of where the master is, and they drop any job they're running. So far it hasn't been many builds because the number running is still small after the tree opening.
<ravi> there is a case open with the vendor
<ravi> yeah. so the system is failing over to the standby so it shouldnt' be a hard down for mroe than a very small period
| Reporter | ||
Comment 1•15 years ago
|
||
This is a change since bug 591468, the network changes in the Castro office. Bugzilla decrees that I can't set that blocking this because 591468 is not visible to me.
| Reporter | ||
Updated•15 years ago
|
Summary: Network in Castro failing periodically → Network in Castro failing intermittently
| Reporter | ||
Comment 2•15 years ago
|
||
And at ~ 19:49.
Comment 3•15 years ago
|
||
I opened that bug up. Issues being worked on. Is this affecting the tree?
| Reporter | ||
Comment 4•15 years ago
|
||
(discussed on irc). The short is yes, because network interruptions end up dropping compiles/tests/perf tests on the floor. They can be rerun by RelEng by manual intervention, but that won't scale to a full weekday of load.
Also from IRC, IT have disabled one half of the High Availability setup. There seems to be 'some condition that is causing the standby unit to think the primary failed and it takes over, but nothing failed'.
Comment 5•15 years ago
|
||
Ravi is running point here.
Assignee: server-ops → ravi
Component: Server Operations → Server Operations: Netops
| Reporter | ||
Comment 6•15 years ago
|
||
Bounced again at ~ 0245. Different failure mode ?
Comment 7•15 years ago
|
||
Again at 6:00am, October 18.
Comment 8•15 years ago
|
||
Again at 7:52am, Oct 18
Comment 9•15 years ago
|
||
Again at 08:11, Oct 18.
Upgrading to blocker, this is causing lots of stuff to fail.
Severity: critical → blocker
Updated•15 years ago
|
Assignee: ravi → network-operations
Comment 10•15 years ago
|
||
It's also killing Internet access for people at the office. Makes getting work done difficult.
Comment 12•15 years ago
|
||
Folks are working on it, no need to page oncall.
Assignee: network-operations → ravi
Comment 13•15 years ago
|
||
It would be useful if someone from the IT team could email mv-all@ to let people know what the symptoms are, and what the ETA is for correction. That way decisions can be made about whether to commute into the office, etc.
Comment 14•15 years ago
|
||
Ravi is going to arrive onsite before I do, he should be able to update shortly.
Comment 15•15 years ago
|
||
(In reply to comment #13)
> It would be useful if someone from the IT team could email mv-all@ to let
> people know what the symptoms are, and what the ETA is for correction. That way
> decisions can be made about whether to commute into the office, etc.
I just talked with Ravi in the car; he estimates that things should be back
working again ~30mins. More info as he arrives, but yes, commuting to office
still worthwhile.
Comment 16•15 years ago
|
||
Great. Who's writing the email to mv-all@? Or are you counting on everyone to be cc'd to this bug.
Updated•15 years ago
|
Whiteboard: ETA: to be resolved by 9:30am PT, Monday October 18
Comment 17•15 years ago
|
||
mrz has just applied the suggested fix from the vendor. If the problem persists, we will revert the firewall upgrade which was performed over the weekend.
Either way, we do not expect this problem to continue into the work day.
| Assignee | ||
Comment 18•15 years ago
|
||
Lowering priority as things have been stable for 7 hours.
Severity: blocker → major
Status: NEW → ASSIGNED
Comment 19•15 years ago
|
||
don't know if its related, but i just tried to send and email through smtp.mozilla.org and it failed 3 times. Without changing any settings in my application and with no delay the 4th click on the send button did properly send the email.
Comment 20•15 years ago
|
||
etherpad.mozilla.com craps out from time to time, as do other mozilla sites; not sure if this is fully fixed.
| Assignee | ||
Comment 21•15 years ago
|
||
http://etherpad.mozilla.com/ gives me 'invalid superdomain' I'm not clear what, if anything different, it should look like.
Comment 22•15 years ago
|
||
Toronto doesn't rely on Mountain View to hit the San Jose datacenter.
http://etherpad.mozilla.com:9000/ works for me from Mountain View (no VPNs up). When it times out for you, could you paste in 'traceroute etherpad.mozilla.com" ?
| Assignee | ||
Comment 23•15 years ago
|
||
(In reply to comment #19)
Is your sample size only those 4 connections today?
Comment 24•15 years ago
|
||
(In reply to comment #23)
> Is your sample size only those 4 connections today?
I've had mxr.mozilla.org return a sever not found several times today throughout the day. Hitting refresh would make it work again.
Of course, I'm not sure if that's the network, or minefield...
Comment 25•15 years ago
|
||
And just trying to post that comment via the Bugzilla REST API has failed again on the first attempt. This is the second time I've hit that today.
| Assignee | ||
Comment 26•15 years ago
|
||
bugzilla, mxr, mail, etc are all external hosts not in MTV or related to the issues in MTV. It is possible throughout the day there may have been some blips (we had a power outage this afternoon). Unless the host you are trying to reach is in 10.250.0.0/16 or you in MTV1 trying to reach resources it is very unlikely this is the appropriate bug for those issues and more likely an unrelated issue.
Everyone from mv.mozilla.com in #developers timed out at 7:39 and change EST.
Comment 28•15 years ago
|
||
I do believe this was all resolved by 5:30pm when I was heading out. At some point we flipped over to the backup Internet provider.
Undoing that contributed to some of the issues you saw late in the day.
Calling fixed, re-open if you see problems related to this.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Comment 29•15 years ago
|
||
(In reply to comment #25)
> And just trying to post that comment via the Bugzilla REST API has failed again
> on the first attempt. This is the second time I've hit that today.
Still getting this. It's the same behavior that I'm seeing in the browser too. Our web properties keep returning a null response (as viewed from an XHR) randomly. The browser will display a server not found page when this happens.
Comment 30•15 years ago
|
||
You only see this to Mozilla hosts?
Comment 31•15 years ago
|
||
(In reply to comment #30)
> You only see this to Mozilla hosts?
It's the only ones I've hit this on, but like I said it is random. My sample size is largely mozilla hosts too though.
Comment 32•15 years ago
|
||
(In reply to comment #31)
> It's the only ones I've hit this on, but like I said it is random. My sample
> size is largely mozilla hosts too though.
Just hit it on forecast.weather.gov (not one of our web properties).
Comment 33•15 years ago
|
||
And I can now confirm that this isn't just me and my machine. The tv that displays tinderbox pushlog hit this this morning booting up.
| Assignee | ||
Comment 34•15 years ago
|
||
What does hit mean in this context?
Whiteboard: ETA: to be resolved by 9:30am PT, Monday October 18
Comment 35•15 years ago
|
||
s/hit/exerpienced/
Comment 36•15 years ago
|
||
Damon's getting connections reset all the time.
I'm getting timeouts at Bugzilla.
Should these be new bugs? Coincidences? Frustrating!
Comment 37•15 years ago
|
||
New bugs - you're both in different locations and this bug is about Castro.
Comment 38•15 years ago
|
||
(but Ravi's looking at it)
| Assignee | ||
Comment 39•15 years ago
|
||
Connections reset to where? Define reset.
Bugzilla is only an external IP so you're not traversing the tunnel, but rather just the open internet. What aspect is timing out? Page loads? Are you able to get to it at all, some, or none of the time?
I'll look at the router over there to see if there are global issues for the office.
Please put the details in new bugs, please.
Updated•12 years ago
|
Product: mozilla.org → Infrastructure & Operations
Updated•3 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•