Generic6 throwing 503 errors on careers.mozilla.org

RESOLVED FIXED

Status

Infrastructure & Operations Graveyard
WebOps: Engagement
--
critical
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: mkelly, Assigned: cyliang)

Tracking

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/591] )

(Reporter)

Description

3 years ago
We're seeing sparodic 503 errors on careers.mozilla.org. Specifically I'm seeing them whenever "generic6.webapp.phx1.mozilla.com" is the node serving the site.
(Reporter)

Comment 1

3 years ago
And by sporadic I mean "once every 6 or 7 pageloads".

Updated

3 years ago
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/591]
Adding for timelines - FWIW our monitoring caught this on Mon but recovered on the next check. I didn't find the need to dig further since the website loaded just fine...
(Assignee)

Comment 3

3 years ago
These errors may be appearing to do issues with the use of thread-unsafe somewhere within careers.mozilla.org.  I've made an Apache config  settings change which I hope will fix the issue.

In the error log for careers.mozilla.org on generic6, I found errors like this:

[Tue Feb 17 23:03:43 2015] [error] [client 207.126.102.129] (11)Resource temporarily unavailable: mod_wsgi (pid=27597): Unable to connect to WSGI daemon process 'careers' on '/var/run/wsgi.2136.8.7.sock'.

... interspersed with plenty of entries that look normal.  I consulted with with Jake, who suggested that this may be an issue with multi-threaded code.  This is bolstered by the fact that the staging site for careers.mozilla.org does explicitly limits the WSGI daemon to 8 processes and only 1 thread.

I've transferred these properties the production settings file and pushed out the changes.    We may need to monitor for a while to see if these errors crop up again.

Updated

3 years ago
Assignee: server-ops-webops → cliang
(Reporter)

Comment 4

3 years ago
(In reply to C. Liang [:cyliang] from comment #3)
> These errors may be appearing to do issues with the use of thread-unsafe
> somewhere within careers.mozilla.org.  I've made an Apache config  settings
> change which I hope will fix the issue.
>
> In the error log for careers.mozilla.org on generic6, I found errors like
> this:
> 
> [Tue Feb 17 23:03:43 2015] [error] [client 207.126.102.129] (11)Resource
> temporarily unavailable: mod_wsgi (pid=27597): Unable to connect to WSGI
> daemon process 'careers' on '/var/run/wsgi.2136.8.7.sock'.
> 
> ... interspersed with plenty of entries that look normal.  I consulted with
> with Jake, who suggested that this may be an issue with multi-threaded code.
> This is bolstered by the fact that the staging site for careers.mozilla.org
> does explicitly limits the WSGI daemon to 8 processes and only 1 thread.
> 
> I've transferred these properties the production settings file and pushed
> out the changes.    We may need to monitor for a while to see if these
> errors crop up again.

Thanks! :D

Just to clarify, you mean some part of the Careers python app could be thread-unsafe, right? I skimmed the code and didn't see anything sticking out, but it might be more subtle than that. Careers doesn't use any libraries that aren't being successfully used by our other Django sites, either.

Would you recommend trying to track this down or is going forward with your config changes good as a final solution?

Comment 5

3 years ago
We've run into this sort of thing on several of our properties, and this is the typical fix... so much so that normally we just default to it these days. :)

FWIW, I suspect it's nothing in the careers code specifically, but rather something in an underlying library... something in vendor, maybe. We have never fully tracked it down.

I'm fine with just stopping here. If the problem does come back, we can dig more.
(Assignee)

Comment 6

3 years ago
No 503 errors for careers.mozilla.org in the past two weeks.  =)

I'm closing out this bug; please open a new bug and refer back to this one if the problem crops up again.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.