Try to enforce a timeout on each individual request

VERIFIED FIXED

Status

P2
normal
VERIFIED FIXED
5 years ago
4 years ago

People

(Reporter: rfkelly, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [qa+])

Attachments

(2 attachments)

Created attachment 757207 [details] [diff] [review]
patch to timeout individual requests in the gunicorn worker

The standard gunicorn --timeout argument doesn't work like most people expect.  It's not a timeout for each individual request, but rather for the worker process as a while - as long as *some* activity is taking place inside the worker, the timeout will not be triggered.

This patch extends our custom gunicorn worker subclass to do the expected thing, and apply the timeout to each individual request.

There's some caveats to using gevent.Timeout() for this.  In particular, the timeout won't trigger until you yield to the event loop, so it can't interrupt CPU-bound tasks.  But it does the right thing in most cases.

There's also a bug in PyMySQL that prevents this from working quite as well as I'd like, see https://github.com/petehunt/PyMySQL/pull/148

Caveats aside, I've locally tested that this works in the case we're most interested in right now - if the timeout expires, and we kill the running MySQL query, then the request handler will error out rather than retrying the query.
Attachment #757207 - Flags: review?(telliott)
QA Contact: jbonacci
Whiteboard: [qa+]
Comment on attachment 757207 [details] [diff] [review]
patch to timeout individual requests in the gunicorn worker

Fingers crossed. I dislike that we're trying to solve problems with all the layers of wrapping by adding another wrapper class, but recognize that it may be necessary :P
Attachment #757207 - Flags: review?(telliott) → review+
(In reply to Toby Elliott [:telliott] from comment #1)
> Comment on attachment 757207 [details] [diff] [review]
> patch to timeout individual requests in the gunicorn worker
> 
> Fingers crossed. I dislike that we're trying to solve problems with all the
> layers of wrapping by adding another wrapper class

Indeed.  Although I like to think of this as tweaking an existing layer so it works a little better, rather than adding more layers to the system :-)

Committed in http://hg.mozilla.org/services/server-core/rev/ab0cff6c4c17
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Re-opening because we need to get this ported over to the sync1.5 stack.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Created attachment 8386560 [details]
gunicorn_worker.py

Here's the same worker code, with license header changed and renamed to "MozSvcGeventWorker" to be slightly more clear what it's all about.  Let's put it in mozsvc and we can try it for our next deploy.
Attachment #8386560 - Flags: review?(telliott)
Blocks: 907475
No longer blocks: 907479
Blocks: 907479
No longer blocks: 907475
Priority: -- → P2
To clarify: there's no code change in this latest review request, just header stuff? It's not a diff, so I want to be sure.
I changed the name from `MozSvcWorker` to `MozSvcGeventWorker` for a little more clarity, but otherwise no, on code change from the sync1.1 version.
Attachment #8386560 - Flags: review?(telliott) → review+
Committed in https://github.com/mozilla-services/mozservices/commit/8850a298c1352d9e91bd146305fd38edd4374f35

Now we need to make sure it's actually used in our production config - :jbonacci watch out for this one in the next TS/Sync deploy ticket.
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED
We can verify Prod logs I guess.
No way to directly test Prod anymore...
:bobm can you help me verify this in Prod?
I am going to assume we are all good here, otherwise we would know by now (multiple pushes to Prod)...
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.