Closed
Bug 1129398
Opened 9 years ago
Closed 9 years ago
[ReMo][[Prod] 503 error in reps.mozilla.org.
Categories
(Infrastructure & Operations Graveyard :: WebOps: Engagement, task)
Infrastructure & Operations Graveyard
WebOps: Engagement
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: tasos, Assigned: cliang)
Details
(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/410] )
Attachments
(1 file)
1020 bytes,
text/plain
|
Details |
Production server (reps.mozilla.org) randomly generates a 503 error with the following message: "Service Temporarily Unavailable The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later." After a reload, the page loads normally. No traceback is produced on the server and the error can happen to any url.
Reporter | ||
Comment 1•9 years ago
|
||
Hi, Is there any update on the cause of this error? These random 503 errors interrupt some services in the portal, resulting a limited functionality. Thanks!
Reporter | ||
Comment 2•9 years ago
|
||
Bumping this to major since the portal is not working correctly.
Severity: normal → major
Comment 3•9 years ago
|
||
This is pagging oncall - whilste it's not down. Will page some webops to have a look.
Severity: major → normal
Comment 4•9 years ago
|
||
We are currently blocked on pushing to stage and production (until we know what was/is wrong and/or fixed). Can we have an ETA for that?
Comment 5•9 years ago
|
||
The 503 errors are now almost on any other operation you do on the website. The portal is almost not usable. We opened this over a week ago and still no response. Moving it to major.
Severity: normal → major
Comment 6•9 years ago
|
||
Hi, Does this need eyes on it right this second? If so I can wake someone from webops. Otherwise I'll try and get someone to look at it in a few hours when they wake. Let me know. (I tried to contact you on IRC)
Assignee: server-ops-webops → rwatson
Comment 7•9 years ago
|
||
We can wait for a few of hours for the webops to have a look. Thanks
Assignee | ||
Updated•9 years ago
|
Assignee: rwatson → cliang
Assignee | ||
Comment 8•9 years ago
|
||
Tue Jan 27 2015 06:27:22 GMT-0600 (CST) Tue Feb 03 2015 07:33:50 GMT-0600 (CST) Fri Feb 06 2015 11:12:14 GMT-0600 (CST) I believe that I've fixed the 503 errors after explicitly restarting Apache on two of the webheads. (I'm not sure why the apache graceful command that is normally part of the deployment process did not fix the 503 issues.) There are some occasional errors being shown, which I've added as an attachment. (These errors occurred multiple times and on different servers.) I popped onto generic1.webapp.phx1 and started to look at the Apache logs for reps.mozilla.org. These were filled with a large number of errors about being unable to connect to the WSGI daemon process. [1] This usually implies that the WSGI process has managed to crash in some way that does not clear out the old socket file or some incomplete requests are running when Apache is restarted / the WSGI script file is touched. The main Apache error log showed occasional errors of "URI too long" but not often nearly often enough to be the likely culprit. Popping onto generic6.webapp.phx1 to do see if it had anything useful in the error logs, I found that it looked fine (no WSGI daemon process errors). New Relic showed that generic1 and generic2 were no longer reporting to the app. After verifying that these two servers were the ones showing errors and that the socket file mentioned in the error message *did* exist, I gracefully restarted Apache on each of them to force a restart of the WSGI processes and the creation of a new socket file. [1] [Fri Feb 13 14:18:04 2015] [error] [client x.x.x.x] (11)Resource temporarily unavailable: mod_wsgi (pid=17317): Unable to connect to WSGI daemon process 'reps-prod-ssl' on '/var/run/wsgi.1943.7.17.sock'.
Assignee | ||
Comment 9•9 years ago
|
||
Verified that there have been no 503 errors since Friday.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 10•9 years ago
|
||
Unfortunately, I just tested the production environment and it still produces these errors.
Comment 11•9 years ago
|
||
Yes, I would say 50% of the times I request a reps.mozilla.org page I get the error.
Reporter | ||
Comment 12•9 years ago
|
||
I am reopening this bug to track this issue, since 503 errors are still present.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 13•9 years ago
|
||
I think I've temporarily addressed the problem for now by restarting apache. I'll try pushing out a potential fix on Monday. The errors are similar to the ones listed above. Only generic2 is showing the error and it just started happening today. The first error occurred at 18:05:28 UTC [1] I've noticed that on servers that do not have 503 errors, the reps WSGI daemon has the socket file open only once. On the server displaying errors, the socket file is open multiple times. [2] I'm wondering if there is an issue with threading. I have noticed that the production reps WSGI dameon is not configured with the same explicit list of procs and threads as the one in dev and stage. Since we've been asked not to make changes to Production on Fridays, I'll wait until Monday to push out those configuration changes. [1] ssl_error_log_2015-02-20-18:[Fri Feb 20 18:05:28 2015] [error] [client 95.134.118.144] (11)Resource temporarily unavailable: mod_wsgi (pid=9081): Unable to connect to WSGI daemon process 'reps-prod-ssl' on '/var/run/wsgi.1919.16.16.sock'., referer: https://reps.mozilla.org/e/salon-primevere-espace-numerique-libre/ [2] [cliang@generic5.webapp.phx1 reps.mozilla.org]$ sudo lsof -p 14461 | grep sock$ httpd 14461 apache 228u unix 0xffff8802d974b2c0 0t0 105495589 /var/run/wsgi.2220.15.16.sock versus [cliang@generic2.webapp.phx1 httpd]$ sudo lsof -p 8701 | grep sock$ httpd 8701 apache 11u unix 0xffff88019b424ac0 0t0 585612877 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 15u unix 0xffff8802a1c60c00 0t0 585614717 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 21u unix 0xffff8802e8744200 0t0 585633305 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 23u unix 0xffff88031d2d14c0 0t0 585641809 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 26u unix 0xffff8803396df280 0t0 585651253 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 29u unix 0xffff88026b93d140 0t0 585669005 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 31u unix 0xffff8802f9e95ac0 0t0 585688399 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 33u unix 0xffff8802e3a5d240 0t0 585691177 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 35u unix 0xffff880103450080 0t0 585692742 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 37u unix 0xffff88031d2d0480 0t0 585708787 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 39u unix 0xffff8800a3b31b40 0t0 585719161 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 41u unix 0xffff88001cf484c0 0t0 585720886 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 43u unix 0xffff88010e912900 0t0 585726591 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 45u unix 0xffff880319198c40 0t0 585736306 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 47u unix 0xffff88001cf48800 0t0 585739274 /var/run/wsgi.1919.16.16.sock httpd 8701 apache 228u unix 0xffff8802f9e94a80 0t0 585607805 /var/run/wsgi.1919.16.16.sock
Assignee | ||
Comment 14•9 years ago
|
||
Pushed out a config change making the number of threads and processes for the reps WSGI daemon explicit. Holding this bug open to see if this addresses the issue.
Reporter | ||
Comment 15•9 years ago
|
||
It seems that there are not any more errors. If the logs on the server are clean too, maybe we can close this bug as resolved.
Updated•9 years ago
|
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•