Several reports of slow response for mozilla.org

RESOLVED FIXED

Status

P4
normal
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: rik, Assigned: nmaul)

Tracking

Details

(Whiteboard: [triaged 20121012])

Attachments

(2 attachments)

(Reporter)

Description

6 years ago
In the last couple of days, I've seen a bunch of people complaining about the loading time of www.mozilla.org.
Some of them even saw an error message. Something like Service unavailable or Service not responding (I can't remember, sorry)

I've cced two of them: Pascal is in France, Tomer is in Israel.
(Assignee)

Comment 1

6 years ago
We've had a number of odd networking issues lately which likely contributed to this, so I suspect it may be very hard to track down. Some of these *definitely* contributed heavily (Catchpoint monitoring was highly annoyed at www.mozilla.org's performance). I'm inclined not to investigate much further until we can say with some certainty that those networking issues were not the underlying cause here.

One thing that will definitely help is anything we can do regarding bug 742890. Now is probably not the time to work on this extensively (given the stub installer stuff that's ongoing at the moment), but if there's anything we can do easily (cleanly! don't want to rush bad code to production), it might be worth a quick glance.
Priority: -- → P4
Whiteboard: [triaged 20121012]
I have some connection problems from my location, sometimes pages load very slowly and sometimes not at all. I think this issue appeared few days ago. Is it possible it happen because of heavy load when everyone gets their browser upgrades? 

When I monitored the main page it took more than a minute to load the page, and I found few HTTP 500 errors with some images and scripts attached to the main page.

Monitoring the page https://www.mozilla.org/en-US/ using cURL showed that sometimes it loads the page in around 15 seconds, while sometimes it failed with an error after very long time. About 1/10 of my tests failed with that error message.


$ time curl -vv https://www.mozilla.org/en-US/

* About to connect() to www.mozilla.org port 443 (#0)
*   Trying 63.245.213.92... connected
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSL connection using RC4-SHA
* Server certificate:
* 	 subject: businessCategory=Private Organization; 1.3.6.1.4.1.311.60.2.1.3=US; 1.3.6.1.4.1.311.60.2.1.2=California; serialNumber=C2543436; C=US; ST=California; L=Mountain View; O=Mozilla Foundation; OU=IT Operations; CN=www.mozilla.org
* 	 start date: 2011-12-15 20:35:47 GMT
* 	 expire date: 2013-12-16 21:23:08 GMT
* 	 subjectAltName: www.mozilla.org matched
* 	 issuer: C=US; O=GeoTrust Inc; OU=See www.geotrust.com/resources/cps (c)06; CN=GeoTrust Extended Validation SSL CA
* 	 SSL certificate verify ok.
> GET /en-US/ HTTP/1.1
> User-Agent: curl/7.22.0 (i686-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> Host: www.mozilla.org
> Accept: */*
> 
< HTTP/1.1 500 Internal Server Error
< Date: Fri, 12 Oct 2012 23:21:15 GMT
< Connection: close
< Content-Type: text/html
< 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<title>Service Unavailable</title>
<style type="text/css">
body, p, h1 {
  font-family: Verdana, Arial, Helvetica, sans-serif;
}
h2 {
  font-family: Arial, Helvetica, sans-serif;
  color: #b10b29;
}
</style>
</head>
<body>
<h2>Service Unavailable</h2>
<p>The service is temporarily unavailable. Please try again later.</p>
</body>
</html>
* Closing connection #0
* SSLv3, TLS alert, Client hello (1):

real	1m13.357s
user	0m0.020s
sys	0m0.012s
(Reporter)

Comment 3

6 years ago
Jake: I'd argue that this is more important than the stub installer. With those issues, a lot of people can't download Firefox. That's more important than a test.
(Assignee)

Comment 4

6 years ago
Created attachment 670986 [details]
graph showing recent response times

I understand and don't disagree that it's important. However, please see the attached graph. That's why I'm hesitant to spend time investigating this more thoroughly right now. The statistical data strongly indicates that this is vastly improved as of approximately 3pm PT today.

The 4 red dots around 30 seconds or so are Munich, Rome, London, and Amsterdam. This may indicate a problem in the EU... or just be bad luck (4 data points don't make much of a trend). I can look at the AMS1 load balancer cluster specifically, or even disable it and send that traffic back to PHX1/SCL3 like everything else... but there's a decent chance that would actually be *worse* overall, rather than better.
(Assignee)

Comment 5

6 years ago
Looking further back in Catchpoint, I see that those red ~30s timeouts started around October 8, and are largely concentrated in Europe.

I believe this to be somewhat related to problems in SCL3... according to the Zeus cluster in AMS1, I am getting some flapping when it tries to reach the SCL3 origin.

For the time being, I have disabled the SCL3 origin in the AMS1 Zeus cluster. In a few hours it should become obvious if this has a noticeable impact.
(Assignee)

Comment 6

6 years ago
Created attachment 671517 [details]
www.mozilla.org response times, 10/15

This seems to have helped drastically. I can't be 100% sure that there haven't been network improvements around the same time so we might want to flip back and forth once or twice to be sure, but it definitely seems much better now.
(Reporter)

Comment 7

6 years ago
Thanks for taking a look at this late on a Friday!
(Assignee)

Updated

6 years ago
Duplicate of this bug: 800613
(Assignee)

Comment 9

6 years ago
I've spoken to our netops folks, and there was indeed an issue in SCL3 from 10/8 to 10/12, which is now be fixed. I've re-enabled the SCL3 node in the AMS1 backend pool, and all seems well after several hours.

Calling this one fixed. Thanks!
Assignee: server-ops-webops → nmaul
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.