Closed Bug 972857 Opened 11 years ago Closed 11 years ago

the return of the hanging stackwalker

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

People

(Reporter: lars, Assigned: lars)

Details

(Whiteboard: [config])

Attachments

(1 file, 1 obsolete file)

limiter - a shell script that will limit the execution time of another process 11 years ago K Lars Lohn [:lars] [:klohn] 1.16 KB, application/x-shellscript		Details
a list of crash_ids for which stackwalker took 10s or longer 11 years ago K Lars Lohn [:lars] [:klohn] 6.00 KB, text/plain		Details

K Lars Lohn [:lars] [:klohn]

Assignee

Description

•

11 years ago

Attached file limiter - a shell script that will limit the execution time of another process (obsolete) — Details

when we moved to the new 'stackwalker' the problem of processor threads hanging suddenly went away. The processors shut down properly and there were no longer any zombie stackwalkers. The problem has returned, but in a slightly trickier form. It seems to do it only on certain machines, but not others. On staging, a certain crash will always cause stackwalker to hang on processor1 but if processed on processor2, it doesn't hang. However, this is a heisenbug, as the next day, that crash succeeds on both processors. I've got a version of the processor that has a "nuclear" option for killing hanging stackwalkers, but I'm not wanting to use it. It is tricky, opaque and depends on mulithreading. It won't work if we decide to move to single thread processors. I have a different solution in mind. Rather than going through all the detect/kill threading and/or employing the "python self-pipe trick", I want to limit the run time of the stackwalker in a different way. The stackwalker is invoked in a subprocess shell. We do that to capture the chatty stderr output of stackwalker and redirect it to /dev/null. This is to avoid another known deadlocking problem common in python subprocess management. Since we're using a shell, I propose using a shell script to limit the runtime of the stackwalker. This sort of change can happen with configuration alone without having to change the processor code. stackwalker --pipe .... becomes limiter 5 stackwalker --pipe .... limiter is a shell script that will kill whatever process it is given on the command line if that process isn't finished within the given number of seconds. I've run this on staging processor 1 for a couple hours and it appears to be working. I'll attach the script for vetting.

K Lars Lohn [:lars] [:klohn]

Assignee

Comment 1

•

11 years ago

tmary points out that there exists "timeout" already in coreutils to accomplish this task. I withdraw the proposal of the attached script and suggest we use timeout instead. stackwalker --pipe .... becomes timeout -s KILL 5 stackwalker --pipe ....

Brandon Burton [:solarce]

Comment 2

•

11 years ago

+1 to tmary's suggestion, I'm gonna go spend an hour with http://www.gnu.org/software/coreutils/ now because I didn't know about timeout

Lonnen :lonnen

Comment 3

•

11 years ago

If you're happy with it on stage I'm r+ for putting it on prod.

K Lars Lohn [:lars] [:klohn]

Assignee

Comment 4

•

11 years ago

I've added this to the staging config in svn. We'll run it over the long weekend and assess its value for adding to prod next week.

K Lars Lohn [:lars] [:klohn]

Assignee

Comment 5

•

11 years ago

the timeout ran great in staging for days. I set the timeout for 10 seconds. It the interval between 2/16 and 2/19 the two processors SIGKILL'd the stackwalker 314 times. That indicates that sometimes the stackwalker just takes longer than 10 seconds. No stackwalker zombies were created during that time, no threads hung forever waiting for stackwalker. I think this is a win. What are the ramifications of setting the timeout to 15 or 20? That might make the shutdown take too long if we happen to do a shutdown request during a long running stackwalker.

Whiteboard: [config]

K Lars Lohn [:lars] [:klohn]

Assignee

Comment 6

•

11 years ago

Attached file a list of crash_ids for which stackwalker took 10s or longer — Details

attached is a list of crash_ids from staging for which stackwalker was killed for taking too long. The kill threshold was 10s. These uuids should be good on both prod and staging. Looking through them in an informal manner, I'm not spotting any common traits.

Attachment #8376284 - Attachment is obsolete: true

Attachment #8378357 - Flags: feedback?(ted)

K Lars Lohn [:lars] [:klohn]

Assignee

Comment 7

•

11 years ago

btw, the 166 killed stackwalkers in the attachment is in contrast with the 314061 invocations of the stackwalker. Thats 0.05% of all crashes. Is that enough to be concerned or is it just noise?

[github robot]

Comment 8

•

11 years ago

Commit pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/0e69aca0138a05b794133d1684877011176c3d63 Merge pull request #1901 from twobraids/log-kill-by-timeout Fixes Bug 972857 - added timeout to stackwalker invocation

[github robot]

Updated

•

11 years ago

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Lonnen :lonnen

Updated

•

11 years ago

Target Milestone: 75 → 78

(not currently active) Ted Mielczarek

Comment 9

•

11 years ago

Comment on attachment 8378357 [details] a list of crash_ids for which stackwalker took 10s or longer The only thing that jumps out at me after looking at the first ~10 of these is that they're mostly Android crashes, with a few Windows Beta crashes thrown in. I wonder if these are mostly just "crashes from non-Windows-Release-Firefox", and what you're actually seeing here is "crashes from less-popular products don't have their symbol files cached in memory so they take longer to read the files off of NFS". Can you query for crashes where we actually had to kill the stackwalker? I'd be interested to see if it's the same pattern but worse, or if it's something else there. I'll also note that I found the stackwalker getting killed in my Vagrant setup pretty frequently, and this PR fixed that: https://github.com/mozilla/socorro/pull/2153 So it's possible that might have a positive impact here.

Attachment #8378357 - Flags: feedback?(ted)

(not currently active) Ted Mielczarek

Comment 10

•

11 years ago

Oh wait, those actually *did* get killed, I missed where you lowered the threshold. I'd be interested to see if the percentage changed since that PR went live. Also, did any of those get re-processed after failing? All of the ones I looked at seem to have processed successfully.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

the return of the hanging stackwalker

Categories

(Socorro :: Backend, task)

Tracking

(Not tracked)

People

(Reporter: lars, Assigned: lars)

References

Details

(Whiteboard: [config])

Crash Data

Security

(public)

User Story

Attachments

(1 file, 1 obsolete file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Updated

Comment 9

Comment 10

Attachment

General

Description

File Name

Content Type