Service unavailable error for reps.mozilla.org/people

RESOLVED WONTFIX

Status

Infrastructure & Operations
WebOps: Other
RESOLVED WONTFIX
6 years ago
5 years ago

People

(Reporter: giorgos, Assigned: jd)

Tracking

Details

Attachments

(1 attachment)

(Reporter)

Description

6 years ago
Created attachment 634322 [details]
Service Unavailable Error

Hi there,

lately we are getting more and more "Service unavailable" errors on http://reps.mozilla.org/people

I don't get any django error emails so I guess this is related to other parts of the stack. Any ideas?

I attach a screenshot with the error

Thanks
(Assignee)

Comment 1

6 years ago
Giorgos,

This error is due to the web server taking to long to generate the page and then the load balancer is timing out.

I have been thinking about this and already set the reps website up for some long-term trending analysis.  Here are a few examples

From Amsterdam:
http://p.catchpoint.com/ui/Entry/PW/V/CHN-KmF-v-YFMg-jPHENVLhgA-AKfAR3-C7Y0tWM-v

From LA:
http://p.catchpoint.com/ui/Entry/PW/V/CHN-KmF-O-YFMg-jPHENVLhgA-AKfAR3-C7Y0tWM-O

This will give you some ideas of what is going on with the main page.  This is the only page currently trending.  I can set up others but it takes several days for the data to collect enough to be meaningful.


Now as to the question about the people page.  This page has some issues that will not be able to be fixed on the back end (adding servers etc).  I think there will need to be a tiered approach to fixing the problem.

First I think that there will need to be some additional assets placed on a CDN, you might consider requesting an account on one of the CDNs that we (IT) manage. Not that there is anything wrong with the CDN you are using, we just have no way to tune or recommend adjustments to it. 

Second, you will need to determine a way to start caching this page.  I understand that you desire real-time status on this page but with 289 components (meaning 289 http requests) there needs to be some relief for the back-end.  I did not look specifically but there is generally a top level page requests that then blocks while doing subsequent requests which can exacerbate this issue even further.  Django has ways to adjust cache headers which are honored by the load balancer (to include a client requested 'hard' refresh like [Ctl] + F5).

Third is that you need to simply reduce the number of elements on this page.  This may mean paginating the people or generating a single image overlay instead of individual requests, or anything you can imagine really.  I have discussed this with a few other folks here in IT and the general consensus is that this page will need at lease some work on the code end if it is ever going to load quickly.  This is especially true the more people the page attempts to display.

At the end of the day there is little that we can do to help given the way this page is designed.  I am happy to help is any way I can but generally the best solutions will come from the code (cache headers, reducing DOM elements, etc).  I think that in the short term if you just set caching headers to say 10 minutes, this will fix the immediate issue giving you time to craft a more elegant solution.

Please let me know if I can be of further assistance.
Assignee: server-ops → jcrowe
(Reporter)

Comment 2

6 years ago
Hey Jason,

Thanks for the detailed reply. It definitely makes sense that the load balancer times out. It takes way too long to render the page on the server. 

Believe I already traced the problem to the way the avatar service (libravatar) works. Due to its federated nature a server side dns query must be completed for each avatar to determine the correct avatar url. This of course was causing huge delays with the server just waiting for dns responses. I committed a patch, which basically caches these responses and it's already in dev. 

Another patch is moving the rendering time mostly to the browser since we are now fetch objects through AJAX. Is will land in the next days.

About caching, we already cache this page for about 10 minutes. I have configured this page to be "private" though, since it contains your username on the top so I guess it gets cached per user. Is there something better we can do on this front?

Cheers,
-g
(Assignee)

Comment 3

6 years ago
Giorgos,

About caching.

As to the logged in users (which I am not so cannot test) I would suggest removing the username at the top so it can have one cached version or make the header (or whatever contains the username) a separate page or element so you can cache the remainder of the page.  I don't know how this would work exactly, my point is that there is almost zero reason to cache for logged in users if the page is different for each user.  The initial view will never be cached and after waiting so long (and possible getting a timeout) I doubt many logged in users are going to be willing to reload.

As for the non-logged-in users (me for one) the page can easily be cached for quite a long time (1 hour or more).  It is currently not cached at all (X-Cache-Info: not cacheable; response specified "Cache-Control: private").  If this is changed it will at least provide a better experience for the anonymous users.

At the end of the day all of this is really just a band-aid and does not address the real issue.  The page is simply architected in a way which is detrimental to performance.  Unless the way the page is designed is changed we will never be able to make the performance even reasonable (let alone fast).
(Assignee)

Comment 4

6 years ago
Giorgos,

I am going to close this bug as there is no actionable for webops here.  Please reopen if you have any specific actions you need from us or if I can provide any further information or assistance.

Regards
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → WONTFIX
(Reporter)

Comment 5

6 years ago
OK Jason. I'll re-check reps caching. Thanks!
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.