Closed Bug 781451 Opened 12 years ago Closed 12 years ago

Log greenlets that block the event loop for too long

Categories

(Cloud Services :: Server: Core, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Assigned: rfkelly)

Details

(Whiteboard: [qa+])

Attachments

(3 files)

This patch uses the new trace functionality in greenlet==0.4.0 to implement a very simple blocking-greenlets detector. It sets a function to be called on every greenlet switch, and just keeps track of the time beteen each switch. If too much time has elapsed, then it logs an error with a stack trace of the offending greenlet. I've implemented it against server-core so that it would be enabled by default on all products.
Attachment #650462 - Flags: review?(rmiller)
Also need to update deps to greenlet 0.4
Attachment #650463 - Flags: review?(rmiller)
I should note that this seems to work correctly in my local tests, and the logic seems sound to me, but I won't be terribly surprised if some edge-case means it gives false positives on stage. If it does so, the checking can be disabled by setting the GEVENT_MAX_BLOCKING_TIME environment variable to zero.
Comment on attachment 650462 [details] [diff] [review] helper to detect blocking calls in gevent Review of attachment 650462 [details] [diff] [review]: ----------------------------------------------------------------- This is a bit more import-time-side-effect'y than I usually like, but as it's server-core it's probably not worth a lot of effort to wire it up to the app initialization code.
Attachment #650462 - Flags: review?(rmiller) → review+
Attachment #650463 - Flags: review?(rmiller) → review+
(In reply to Rob Miller [:RaFromBRC :rmiller] from comment #3) > Comment on attachment 650462 [details] [diff] [review] > helper to detect blocking calls in gevent > > Review of attachment 650462 [details] [diff] [review]: > ----------------------------------------------------------------- > > This is a bit more import-time-side-effect'y than I usually like, but as > it's server-core it's probably not worth a lot of effort to wire it up to > the app initialization code. Yeah, it is a bit icky like that, but no more so than the import-time gevent monkey-patching we do elsewhere IMO. One other option is to put things like this into a custom gunicorn worker class, so they are only apply at execution time to the processes that need them. This was committed in http://hg.mozilla.org/services/server-core/rev/8284b086fbd5, I am leaving the bug open while we see if it actually works...
Whiteboard: [qa+]
After trying this out on stage, I've made some adjustments to how it works. First, I have removed the traceback from the original error logs. The error is only logged when the code eventually yields back to gevent, which probably happens in some unrelated piece of well-behaved code. The traceback was thus rather useless in tracking down the problem. To go with this, I've added a check_for_gevent_blocking() context manager. This can be used to determine whether a particular piece of code actually yields to gevent, and log an error if it does not. The error message from this one includes a traceback because it will actually point to somewhere near the code that is at fault.
Assignee: nobody → rfkelly
Attachment #650756 - Flags: review?(rmiller)
Attachment #650756 - Flags: review?(rmiller) → review+
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Verified changes in code.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: