Closed Bug 1252382 Opened 4 years ago Closed 4 years ago

Infinite recursion in RunWatchdog() trying to call PR_Sleep after NSPR is shut down

Categories

(NSPR :: NSPR, defect, critical)

defect
Not set
critical

Tracking

(firefox47 fixed)

RESOLVED FIXED
Tracking Status
firefox47 --- fixed

People

(Reporter: jesup, Assigned: Yoric)

Details

(Keywords: crash)

Attachments

(2 files)

RunWatchdog call PR_Sleep for a second at a time.  It never exits.

During shutdown, it's possible for it to call PR_Sleep when NSPR is in an unusable/shut-down state due to bad luck/timing.  This means that when PR_Sleep calls PR_GetCurrentThread(), it tries to run pt_AttachThread(), which fails the pthread_setspecific() call, causing a PR_ASSERT() -- which calls PR_GetCurrentThread(), lather, rinse, repeat.  20000 stack frames later it crashes with SEGV, causing exit code -11/-139 (-(11|128) - 128 means core dumped).

Things are far enough shut down that info threads in gdb shows "cannot find new threads: generic error".  And printing _pr_initialized causes an access error.

I can't say this is the cause of *all* -11 exit codes... but this would cause random failures  during shutdown.  The permafail I see trying to land my full_duplex patches makes sense in this context, and why a little shutdown sequence tweaking caused it to go from 100% fail to 10-20%.  I got lucky and an ASAN mochitest run of dom/media/tests/mochitest I was doing triggered it twice in a row (at the same place I see the permafail (shutting down at the end of the identity tests), though in this case I didn't have those patches applied!)


Solutions: probably are several.  As a kludge to test, I upped the "tick length" to 120 seconds (rounded up!)  We really need to ensure that it shuts down the RunWatchdog thread (and any others), or we ensure they won't run any more (locks/mutexes, ugh), or that it will safely fail-quiet if NSPR is late in shutdown.
Flags: needinfo?(dteller)
Can we sleep(3) instead?  Honestly I have no idea why PR_Sleep is a thing at all.
Ah, sleep isn't actually in the standard library ... still, Windows has an equivalent.  We could ditch NSPR here.
Sure. We could use sleep(3) and Sleep. I'll post a patch later today.
Flags: needinfo?(dteller)
The terminator currently makes use of PR_Sleep. As it turns out, late
in shutdown, this can cause infinite recurions and finally -11. This
patch replaces PR_Sleep with sleep(3) and Sleep().

Review commit: https://reviewboard.mozilla.org/r/37347/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/37347/
Attachment #8725131 - Flags: review?(nfroyd)
Assignee: nobody → dteller
Comment on attachment 8725131 [details]
MozReview Request: Bug 1252382 - Get rid of PR_Sleep in the terminator;r?froydnj

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37347/diff/1-2/
Comment on attachment 8725131 [details]
MozReview Request: Bug 1252382 - Get rid of PR_Sleep in the terminator;r?froydnj

https://reviewboard.mozilla.org/r/37347/#review33941

r=me with the below changes.

::: toolkit/components/terminator/nsTerminator.cpp:37
(Diff revision 2)
> +#include "Windows.h"
> +#else
> +#include "unistd.h"

These should be included with brackets, not quotes, correct?

::: toolkit/components/terminator/nsTerminator.cpp:150
(Diff revision 2)
> +    sleep(1 /* sec */);

Let's use usleep here instead of sleep:

  usleep(1000000 /* usec */);
  
sleep is implemented with signals at least on Linux.  usleep is implemented with actual system calls on Linux and Mac.

In an ideal world, we would just use nanosleep and handle interruption, but one step at a time.
Attachment #8725131 - Flags: review?(nfroyd) → review+
Comment on attachment 8725131 [details]
MozReview Request: Bug 1252382 - Get rid of PR_Sleep in the terminator;r?froydnj

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37347/diff/2-3/
https://hg.mozilla.org/mozilla-central/rev/e9a72b270abb
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.