Closed Bug 712237 Opened 13 years ago Closed 13 years ago

developer.mozilla.org is not responding

Categories

(Infrastructure & Operations Graveyard :: NetOps: DC ACL Request, task)

All
Other
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: MatsPalmgren_bugz, Assigned: ahill)

References

()

Details

1. load https://developer.mozilla.org/en/CSS/cursor

Actual Result:

The connection was reset        
        
          The connection to the server was reset while the page was loading.
  
  The site could be temporarily unavailable or too busy. Try again in a few
    moments.
  If you are unable to load any pages, check your computer's network
    connection.
  If your computer or network is protected by a firewall or proxy, make sure
    that Nightly is permitted to access the Web.
CC'ing in jakem. Looking at deki-api.log, I see a lot of bugzilla lookup timeouts. I suppose this is a macro or somesuch that runs on the server-side for references to the bugs in the wiki pages.

2011-12-20 08:37:50,587 [DispatchThread #9692] ERROR MindTouch.Dream.Http.HttpPlugEndpoint - HandleInvoke@BeginGetResponse(GET, https://bugzilla.mozilla.org/show_bug.cgi?id=629801) System.TimeoutException: async operation timed out

Which I verified to be correct via curl on one of the webheads:

[root@pm-dekiwiki03 ~]# curl -v https://bugzilla.mozilla.org/show_bug.cgi?id=269482 -o /dev/null
* About to connect() to bugzilla.mozilla.org port 443
*   Trying 63.245.217.60... Connection timed out
* couldn't connect to host
* Closing connection #0

curl: (7) couldn't connect to host

FYI bmo traffic was switched over from SJC1 to PHX at around 0030 after a 20 min outage. I don't yet know why the deki webheads can't contact bmo.
This seems to be some sort of ACL issue. Netops, any ideas?

Since Bugzilla flipped back to PHX1, developer.mozilla.org (in SJC1) cannot reach it.

The servers are:

pm-dekiwiki01, 02, 03:
10.2.81.30
10.2.81.49
10.2.81.50
Assignee: server-ops → network-operations
Component: Server Operations: Web Operations → Server Operations: ACL Request
QA Contact: cshields → network-operations-acl
Please provide the source(s) so we can verify the flows.
The 3 IPs in comment 2 are the source... the destination IP is in comment 1 (63.245.217.60).
Assignee: network-operations → ravi
Would be good to get this fixed asap; this is increasingly a major problem. It's not only generating a lot of complaints but is blocking the writing team from getting anything done. :)
Severity: major → critical
Should be better now.

[root@pm-dekiwiki01 ~]# telnet bugzilla.mozilla.org 443
Trying 63.245.209.72...
Connected to bugzilla.mozilla.org (63.245.209.72).
Escape character is '^]'.
Connection closed by foreign host.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Those both work for me now, as does the original link (https://developer.mozilla.org/en/CSS/cursor). I believe what we have here is more of a general performance issue, and that the MDN->Bugzilla connection is functional again.

The performance issue, whether it be Zeus or MDN, are both tracked in separate bugs. pm-dekiwiki02 and 03 seem to be slower than 01 for some things, and the PHX1 Zeus cluster (where Bugzilla lives) is undergoing some maintenance right now. Either could be contributing to this.

Those bugs are:

Bug 713685 (PHX1 Zeus)
Bug 713363 (pm-dekiwiki03 flapping)
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
This seems to be broken again.

[root@pm-dekiwiki01 ~]# telnet bugzilla.mozilla.org 443
Trying 63.245.217.60...
telnet: connect to address 63.245.217.60: Connection timed out
telnet: Unable to connect to remote host: Connection timed out

[root@pm-dekiwiki01 ~]# curl -I -L https://bugzilla.mozilla.org/
curl: (7) couldn't connect to host
Assignee: ravi → ahill
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 713366
Blocks: 712928
All the proper permits are still in place on the network.  Is there anything on the lb that might have changed that could be blocking this now?
Nothing that I am aware of- Bugzilla should be wide open, and we don't do any relevant outbound filtering from servers that would affect this.

It seems to be working again now, although in some cases it's very slow. Sometimes connections are instant, and other times they seem to take much, much longer... I'm trying to get some stats on this now.

Can you confirm that the relevant rules apply to all 3 pm-dekiwiki* servers? I'm seeing some inconsistent behavior between them, just want to rule out the network. The 3 IPs are in comment 2.
For the record, I'm testing like this:

while true; do curl -m 15 https://bugzilla.mozilla.org/show_bug.cgi?id=452232 -o /dev/null; sleep 60; done 2>&1 | grep -v '% Total'

Success happens in less than 5 seconds, always (usually 1-2 seconds). Failure takes 15 seconds (timeout).

I'm still gathering data, but so far it seems like pm-dekiwiki03 fails more often than the other 2. None of them succeed or fail all the time, though... it's intermittent. The overall failure rate seems to be around 15-25%, of which probably 75% is concentrated on pm-dekiwiki03.
oremj just made a load balancer change in PHX1 that I think has helped this a lot. Specifically, tcp_tw_recycle is now disabled. This setting apparently can cause problems resulting in dropped packets, especially when NAT is involved (it is in this case... pm-dekiwiki* have to go through a NAT to get to Bugzilla). Since the change was made, I haven't had a single test fail.

I think this setting was originally put into place a few days ago... possibly even before you initially set up the ACLs. The symptom is dropped packets, but as opposed to an ACL issue, it's intermittent and not consistent. When I re-opened this it seemed to be consistent, but I suspect if I had tested longer back then it would have proved to be intermittent.

I'm going to close this back out. Thanks for the (extended) help!
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.