Closed Bug 786176 Opened 12 years ago Closed 12 years ago

Custom gunicorn worker with support for gevent-blocking-detection

Categories

(Cloud Services :: Server: Core, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Unassigned)

Details

(Whiteboard: [qa+])

Attachments

(1 file)

This bug further refines the blocking-detection from Bug 781451.

In this iteration, we define a custom gunicorn worker class that subclasses from GeventWorker and adds active blocking detection.  Like Bug 781451, it uses the greenlet.settrace() function to monitor switching between greenlets.  Unlike Bug 781451, it uses a dedicated background thread to monitor for blocking and report it when it occurs.

The advantage of this approach is that we get a traceback for the code at the point where it's actually blocking, not for the code at the point where it finally yields back to the event loop.

It also avoids icky import-time side-effects and makes it easy to switch it on or off just by changing the gunicorn worker class.
Attachment #655904 - Flags: review?(rmiller)
Whiteboard: [qa+]
Comment on attachment 655904 [details] [diff] [review]
patch adding a blocking-detecting gunicorn worker

Review of attachment 655904 [details] [diff] [review]:
-----------------------------------------------------------------

This looks great, other than the possible typo I highlighted.

::: services/gunicorn_worker.py
@@ +52,5 @@
> +
> +import greenlet
> +import gevent.hub
> +
> +from gunicorn.workers.ggevent import GeventWorker

typo?
Attachment #655904 - Flags: review?(rmiller) → review+
(In reply to Rob Miller [:RaFromBRC :rmiller] from comment #1)
> > +from gunicorn.workers.ggevent import GeventWorker
> 
> typo?

No, just an unfortunately-named submodule "ggevent" :-)

Committed on release branch in http://hg.mozilla.org/services/server-core/rev/2fd3dd9988f5

I will leave this bug open until we finish the release and it is merged back to default branch.
Initial deployment to stage started off well, it reported blocking in pylibmc (which was expected) but also in ldap (which was unexpected).

However, after a few minutes it has stopped producing output.  I suspect the monitoring thread may have died silently because everything else in the application seems to be working fine.

So, a good start :-)
We discovered that the stage servers were using an old version of metlog, with a bug that could cause this thread to fail out while logging.  After updating metlog, I observed the monitoring thread to run reliably for quite some time.

So, closing this out.  There are many improvements we could make the worker, but they can be in separate bugs.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
OK. Marking as Verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: