Closed Bug 972857 Opened 11 years ago Closed 11 years ago

the return of the hanging stackwalker

Categories

(Socorro :: Backend, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lars, Assigned: lars)

Details

(Whiteboard: [config])

Attachments

(1 file, 1 obsolete file)

when we moved to the new 'stackwalker' the problem of processor threads hanging suddenly went away.  The processors shut down properly and there were no longer any zombie stackwalkers.

The problem has returned, but in a slightly trickier form.  It seems to do it only on certain machines, but not others.  On staging, a certain crash will always cause stackwalker to hang on processor1 but if processed on processor2, it doesn't hang.  However, this is a heisenbug, as the next day, that crash succeeds on both processors.

I've got a version of the processor that has a "nuclear" option for killing hanging stackwalkers, but I'm not wanting to use it.  It is tricky, opaque and depends on mulithreading.  It won't work if we decide to move to single thread processors.  I have a different solution in mind.  Rather than going through all the detect/kill threading and/or employing the "python self-pipe trick", I want to limit the run time of the stackwalker in a different way. 

The stackwalker is invoked in a subprocess shell. We do that to capture the chatty stderr output of stackwalker and redirect it to /dev/null.  This is to avoid another known deadlocking problem common in python subprocess management.  Since we're using a shell, I propose using a shell script to limit the runtime of the stackwalker.  This sort of change can happen with configuration alone without having to change the processor code. 

    stackwalker --pipe ....

becomes

    limiter 5 stackwalker --pipe ....

limiter is a shell script that will kill whatever process it is given on the command line if that process isn't finished within the given number of seconds.  I've run this on staging processor 1 for a couple hours and it appears to be working.

I'll attach the script for vetting.
tmary points out that there exists "timeout" already in coreutils to accomplish this task.  I withdraw the proposal of the attached script and suggest we use timeout instead.

    stackwalker --pipe ....

becomes

    timeout -s KILL 5 stackwalker --pipe ....
+1 to tmary's suggestion, I'm gonna go spend an hour with http://www.gnu.org/software/coreutils/ now because I didn't know about timeout
If you're happy with it on stage I'm r+ for putting it on prod.
I've added this to the staging config in svn.  We'll run it over the long weekend and assess its value for adding to prod next week.
the timeout ran great in staging for days.  I set the timeout for 10 seconds.  It the interval between 2/16 and 2/19 the two processors SIGKILL'd the stackwalker 314 times.   That indicates that sometimes the stackwalker just takes longer than 10 seconds.  No stackwalker zombies were created during that time, no threads hung forever waiting for stackwalker.  I think this is a win.

What are the ramifications of setting the timeout to 15 or 20?  That might make the shutdown take too long if we happen to do a shutdown request during a long running stackwalker.
Whiteboard: [config]
attached is a list of crash_ids from staging for which stackwalker was killed for taking too long. The kill threshold was 10s.  These uuids should be good on both prod and staging. Looking through them in an informal manner, I'm not spotting any common traits.
Attachment #8376284 - Attachment is obsolete: true
Attachment #8378357 - Flags: feedback?(ted)
btw, the 166 killed stackwalkers in the attachment is in contrast with the 314061 invocations of the stackwalker.  Thats 0.05% of all crashes.  Is that enough to be concerned or is it just noise?
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/0e69aca0138a05b794133d1684877011176c3d63
Merge pull request #1901 from twobraids/log-kill-by-timeout

Fixes Bug 972857 - added timeout to stackwalker invocation
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: 75 → 78
Comment on attachment 8378357 [details]
a list of crash_ids for which stackwalker took 10s or longer

The only thing that jumps out at me after looking at the first ~10 of these is that they're mostly Android crashes, with a few Windows Beta crashes thrown in. I wonder if these are mostly just "crashes from non-Windows-Release-Firefox", and what you're actually seeing here is "crashes from less-popular products don't have their symbol files cached in memory so they take longer to read the files off of NFS".

Can you query for crashes where we actually had to kill the stackwalker? I'd be interested to see if it's the same pattern but worse, or if it's something else there.

I'll also note that I found the stackwalker getting killed in my Vagrant setup pretty frequently, and this PR fixed that:
https://github.com/mozilla/socorro/pull/2153

So it's possible that might have a positive impact here.
Attachment #8378357 - Flags: feedback?(ted)
Oh wait, those actually *did* get killed, I missed where you lowered the threshold. I'd be interested to see if the percentage changed since that PR went live. Also, did any of those get re-processed after failing? All of the ones I looked at seem to have processed successfully.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: