Production server (reps.mozilla.org) randomly generates a 503 error with the following message: "Service Temporarily Unavailable The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later." After a reload, the page loads normally. No traceback is produced on the server and the error can happen to any url.
Hi, Is there any update on the cause of this error? These random 503 errors interrupt some services in the portal, resulting a limited functionality. Thanks!
Bumping this to major since the portal is not working correctly.
Severity: normal → major
This is pagging oncall - whilste it's not down. Will page some webops to have a look.
Severity: major → normal
We are currently blocked on pushing to stage and production (until we know what was/is wrong and/or fixed). Can we have an ETA for that?
The 503 errors are now almost on any other operation you do on the website. The portal is almost not usable. We opened this over a week ago and still no response. Moving it to major.
Severity: normal → major
Hi, Does this need eyes on it right this second? If so I can wake someone from webops. Otherwise I'll try and get someone to look at it in a few hours when they wake. Let me know. (I tried to contact you on IRC)
Assignee: server-ops-webops → rwatson
We can wait for a few of hours for the webops to have a look. Thanks
Created attachment 8564191 [details] reps_errors.txt Tue Jan 27 2015 06:27:22 GMT-0600 (CST) Tue Feb 03 2015 07:33:50 GMT-0600 (CST) Fri Feb 06 2015 11:12:14 GMT-0600 (CST) I believe that I've fixed the 503 errors after explicitly restarting Apache on two of the webheads. (I'm not sure why the apache graceful command that is normally part of the deployment process did not fix the 503 issues.) There are some occasional errors being shown, which I've added as an attachment. (These errors occurred multiple times and on different servers.) I popped onto generic1.webapp.phx1 and started to look at the Apache logs for reps.mozilla.org. These were filled with a large number of errors about being unable to connect to the WSGI daemon process.  This usually implies that the WSGI process has managed to crash in some way that does not clear out the old socket file or some incomplete requests are running when Apache is restarted / the WSGI script file is touched. The main Apache error log showed occasional errors of "URI too long" but not often nearly often enough to be the likely culprit. Popping onto generic6.webapp.phx1 to do see if it had anything useful in the error logs, I found that it looked fine (no WSGI daemon process errors). New Relic showed that generic1 and generic2 were no longer reporting to the app. After verifying that these two servers were the ones showing errors and that the socket file mentioned in the error message *did* exist, I gracefully restarted Apache on each of them to force a restart of the WSGI processes and the creation of a new socket file.  [Fri Feb 13 14:18:04 2015] [error] [client x.x.x.x] (11)Resource temporarily unavailable: mod_wsgi (pid=17317): Unable to connect to WSGI daemon process 'reps-prod-ssl' on '/var/run/wsgi.1943.7.17.sock'.
Verified that there have been no 503 errors since Friday.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Unfortunately, I just tested the production environment and it still produces these errors.
Yes, I would say 50% of the times I request a reps.mozilla.org page I get the error.
I am reopening this bug to track this issue, since 503 errors are still present.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I think I've temporarily addressed the problem for now by restarting apache. I'll try pushing out a potential fix on Monday. The errors are similar to the ones listed above. Only generic2 is showing the error and it just started happening today. The first error occurred at 18:05:28 UTC  I've noticed that on servers that do not have 503 errors, the reps WSGI daemon has the socket file open only once. On the server displaying errors, the socket file is open multiple times.  I'm wondering if there is an issue with threading. I have noticed that the production reps WSGI dameon is not configured with the same explicit list of procs and threads as the one in dev and stage. Since we've been asked not to make changes to Production on Fridays, I'll wait until Monday to push out those configuration changes.  ssl_error_log_2015-02-20-18:[Fri Feb 20 18:05:28 2015] [error] [client 188.8.131.52] (11)Resource temporarily unavailable: mod_wsgi (pid=9081): Unable to connect to WSGI daemon process 'reps-prod-ssl' on '/var/run/wsgi.1919.16.16.sock'., referer: https://reps.mozilla.org/e/salon-primevere-espace-numerique-libre/  [firstname.lastname@example.org reps.mozilla.org]$ sudo lsof -p 14461 | grep sock$ httpd 14461 apache 228u unix 0xffff8802d974b2c0 0t0 105495589 /var/run/wsgi.2220.15.16.sock versus [email@example.com httpd]$ sudo lsof -p 8701 | grep sock$ httpd 8701 apache 11u unix 0xffff88019b424ac0 0t0 585612877 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 15u unix 0xffff8802a1c60c00 0t0 585614717 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 21u unix 0xffff8802e8744200 0t0 585633305 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 23u unix 0xffff88031d2d14c0 0t0 585641809 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 26u unix 0xffff8803396df280 0t0 585651253 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 29u unix 0xffff88026b93d140 0t0 585669005 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 31u unix 0xffff8802f9e95ac0 0t0 585688399 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 33u unix 0xffff8802e3a5d240 0t0 585691177 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 35u unix 0xffff880103450080 0t0 585692742 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 37u unix 0xffff88031d2d0480 0t0 585708787 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 39u unix 0xffff8800a3b31b40 0t0 585719161 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 41u unix 0xffff88001cf484c0 0t0 585720886 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 43u unix 0xffff88010e912900 0t0 585726591 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 45u unix 0xffff880319198c40 0t0 585736306 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 47u unix 0xffff88001cf48800 0t0 585739274 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 228u unix 0xffff8802f9e94a80 0t0 585607805 /var/run/wsgi.1919.16.16.sock
Pushed out a config change making the number of threads and processes for the reps WSGI daemon explicit. Holding this bug open to see if this addresses the issue.
It seems that there are not any more errors. If the logs on the server are clean too, maybe we can close this bug as resolved.
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago → 3 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.