1129398 - [ReMo][[Prod] 503 error in reps.mozilla.org.

Reporter

Description

•

9 years ago

Production server (reps.mozilla.org) randomly generates a 503 error with the following message:

"Service Temporarily Unavailable

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later."

After a reload, the page loads normally. No traceback is produced on the server and the error can happen to any url.

:kanban

Updated

•

9 years ago

Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/410]

Tasos Katsoulas [:tasos]

Reporter

Comment 1

•

9 years ago

Hi,

Is there any update on the cause of this error? These random 503 errors interrupt some services in the portal, resulting a limited functionality.

Thanks!

Tasos Katsoulas [:tasos]

Reporter

Comment 2

•

9 years ago

Bumping this to major since the portal is not working correctly.

Severity: normal → major

Ludovic Hirlimann [:Usul]

Comment 3

•

9 years ago

This is pagging oncall - whilste it's not down. Will page some webops to have a look.

Severity: major → normal

Pierros Papadeas [:pierros]

Comment 4

•

9 years ago

We are currently blocked on pushing to stage and production (until we know what was/is wrong and/or fixed). Can we have an ETA for that?

Pierros Papadeas [:pierros]

Comment 5

•

9 years ago

The 503 errors are now almost on any other operation you do on the website. The portal is almost not usable. We opened this over a week ago and still no response. Moving it to major.

Severity: normal → major

Ryan Watson [:w0ts0n]

Comment 6

•

9 years ago

Hi,

Does this need eyes on it right this second? If so I can wake someone from webops. Otherwise I'll try and get someone to look at it in a few hours when they wake.

Let me know. (I tried to contact you on IRC)

Assignee: server-ops-webops → rwatson

Nikos Roussos [:comzeradd]

Comment 7

•

9 years ago

We can wait for a few of hours for the webops to have a look.

Thanks

C. Liang [:cyliang]

Assignee

Updated

•

9 years ago

Assignee: rwatson → cliang

C. Liang [:cyliang]

Assignee

Comment 8

•

9 years ago

Attached file reps_errors.txt — Details

Tue Jan 27 2015 06:27:22 GMT-0600 (CST)
Tue Feb 03 2015 07:33:50 GMT-0600 (CST)
Fri Feb 06 2015 11:12:14 GMT-0600 (CST)



I believe that I've fixed the 503 errors after explicitly restarting Apache on two of the webheads.  (I'm not sure why the apache graceful command that is normally part of the deployment process did not fix the 503 issues.)  There are some occasional errors being shown, which I've added as an attachment.  (These errors occurred multiple times and on different servers.)



I popped onto generic1.webapp.phx1 and started to look at the Apache logs for reps.mozilla.org.  These were filled with a large number of errors about being unable to connect to the WSGI daemon process. [1]  This usually implies that the WSGI process has managed to crash in some way that does not clear out the old socket file or some incomplete requests are running when Apache is restarted / the WSGI script file is touched.  

The main Apache error log showed occasional errors of "URI too long" but not often nearly often enough to be the likely culprit.  Popping onto generic6.webapp.phx1 to do see if it had anything useful in the error logs, I found that it looked fine (no WSGI daemon process errors). 

New Relic showed that generic1 and generic2 were no longer reporting to the app.  After verifying that these two servers were the ones showing errors and that the socket file mentioned in the error message *did* exist, I gracefully restarted Apache on each of them to force a restart of the WSGI processes and the creation of a new socket file.


[1] [Fri Feb 13 14:18:04 2015] [error] [client x.x.x.x] (11)Resource temporarily unavailable: mod_wsgi (pid=17317): Unable to connect to WSGI daemon process 'reps-prod-ssl' on '/var/run/wsgi.1943.7.17.sock'.

C. Liang [:cyliang]

Assignee

Comment 9

•

9 years ago

Verified that there have been no 503 errors since Friday.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Tasos Katsoulas [:tasos]

Reporter

Comment 10

•

9 years ago

Unfortunately, I just tested the production environment and it still produces these errors.

Rubén Martín [:Nukeador] ❌ [away till Aug 31st]

Comment 11

•

9 years ago

Yes, I would say 50% of the times I request a reps.mozilla.org page I get the error.

Tasos Katsoulas [:tasos]

Reporter

Comment 12

•

9 years ago

I am reopening this bug to track this issue, since 503 errors are still present.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

C. Liang [:cyliang]

Assignee

Comment 13

•

9 years ago

I think I've temporarily addressed the problem for now by restarting apache.  I'll try pushing out a potential fix on Monday.

The errors are similar to the ones listed above.  Only generic2 is showing the error and it just started happening today.  The first error occurred at 18:05:28 UTC [1]  

I've noticed that on servers that do not have 503 errors, the reps WSGI daemon has the socket file open only once.  On the server displaying errors, the socket file is open multiple times. [2]  I'm wondering if there is an issue with threading.  I have noticed that the production reps WSGI dameon is not configured with the same explicit list of procs and threads as the one in dev and stage.  Since we've been asked not to make changes to Production on Fridays, I'll wait until Monday to push out those configuration changes.  


[1] ssl_error_log_2015-02-20-18:[Fri Feb 20 18:05:28 2015] [error] [client 95.134.118.144] (11)Resource temporarily unavailable: mod_wsgi (pid=9081): Unable to connect to WSGI daemon process 'reps-prod-ssl' on '/var/run/wsgi.1919.16.16.sock'., referer: https://reps.mozilla.org/e/salon-primevere-espace-numerique-libre/

[2]  [cliang@generic5.webapp.phx1 reps.mozilla.org]$ sudo lsof -p 14461 | grep sock$
httpd   14461 apache  228u  unix 0xffff8802d974b2c0      0t0 105495589 /var/run/wsgi.2220.15.16.sock

versus 

[cliang@generic2.webapp.phx1 httpd]$ sudo lsof -p 8701 | grep sock$
httpd   8701 apache   11u  unix 0xffff88019b424ac0      0t0 585612877 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache   15u  unix 0xffff8802a1c60c00      0t0 585614717 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache   21u  unix 0xffff8802e8744200      0t0 585633305 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache   23u  unix 0xffff88031d2d14c0      0t0 585641809 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache   26u  unix 0xffff8803396df280      0t0 585651253 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache   29u  unix 0xffff88026b93d140      0t0 585669005 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache   31u  unix 0xffff8802f9e95ac0      0t0 585688399 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache   33u  unix 0xffff8802e3a5d240      0t0 585691177 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache   35u  unix 0xffff880103450080      0t0 585692742 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache   37u  unix 0xffff88031d2d0480      0t0 585708787 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache   39u  unix 0xffff8800a3b31b40      0t0 585719161 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache   41u  unix 0xffff88001cf484c0      0t0 585720886 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache   43u  unix 0xffff88010e912900      0t0 585726591 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache   45u  unix 0xffff880319198c40      0t0 585736306 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache   47u  unix 0xffff88001cf48800      0t0 585739274 /var/run/wsgi.1919.16.16.sock
httpd   8701 apache  228u  unix 0xffff8802f9e94a80      0t0 585607805 /var/run/wsgi.1919.16.16.sock

C. Liang [:cyliang]

Assignee

Comment 14

•

9 years ago

Pushed out a config change making the number of threads and processes for the reps WSGI daemon explicit.  Holding this bug open to see if this addresses the issue.

Tasos Katsoulas [:tasos]

Reporter

Comment 15

•

9 years ago

It seems that there are not any more errors. If the logs on the server are clean too, maybe we can close this bug as resolved.

Pierros Papadeas [:pierros]

Updated

•

9 years ago

Status: REOPENED → RESOLVED

Closed: 9 years ago → 9 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

8 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard