Closed Bug 1450708 Opened 6 years ago Closed 6 years ago

Crash in RtlRaiseStatus | RtlpUnWaitCriticalSection | RtlLeaveCriticalSection | RunnableFunction<T>::Run

Categories

(Core Graveyard :: Plug-ins, defect, P1)

Unspecified
Windows 10
defect

Tracking

(firefox-esr52 unaffected, firefox-esr6060+ fixed, firefox59 unaffected, firefox60+ fixed, firefox61+ verified)

VERIFIED FIXED
mozilla61
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 60+ fixed
firefox59 --- unaffected
firefox60 + fixed
firefox61 + verified

People

(Reporter: marcia, Assigned: handyman)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is
report bp-dec6a53c-37b2-40e2-9c2a-b84420180210.
=============================================================

Windows 10 crash which appears to have been introduced when 60 was in nightly. Crashes seem to go as far back as Build ID 20180208103049. (81.82% in signature vs 00.81% overall) plugin_filename = NPSWF64_29_0_0_113.dll. Crash reason for all crashes is EXCEPTION_INVALID_HANDLE

Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=4fe6f6560083f8c8257282bef1d4e0ced9d1b975&tochange=c5120bcaf7bdcb5cdb06a02b60bd5bfe6a867d06

Comments:
* I was watching some youtube videos then switched over to an armorgames. I have a total of 2 separate pages open. A youtube home page and this armorgames page. My system is up to date with all the latest windows fixes and the latest nvidia drivers. 
* crash forge of empire 
* was playing with the Scratch Project Editor when it crashed. 

Top 10 frames of crashing thread:

0 ntdll.dll RtlRaiseStatus 
1 ntdll.dll RtlpUnWaitCriticalSection 
2 ntdll.dll RtlLeaveCriticalSection 
3 xul.dll RunnableFunction<void  ipc/chromium/src/base/task.h:350
4 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1040
5 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:517
6 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:364
7 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:319
8 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:299
9 xul.dll nsThread::ThreadFunc xpcom/threads/nsThread.cpp:423

=============================================================
Hey David, looks like this might be a regression from bug 1436253.
Flags: needinfo?(davidp99)
I think I see whats going on here.

1. Thread A posts to the dispatch thread D [1].  It passes a locked monitor M that it holds on its stack.
2. Thread A releases and waits on M.
3. D locks M [2] and notifies A.  Notify == Windows' WakeConditionVariable.  WakeConditionVariable has very little by way of documentation but it must be (in the rare cases where this crash happens) immediately waking A and continuing it.  Note that this is weird because it has to give A ownership of the mutex, but D also holds the mutex.  That's just how wait/notify condition variable operations work.  I assume, in this case, it wouldn't return control to D until the mutex is released by A, but A destroys the mutex (it was on A's stack).
4. D finally regains control and goes to release and free the destroyed mutex.  Blammo.

The stack makes no sense until you realize that the LeaveCriticalSection comes from the MonitorAutoLock destructor's call, which is tail-call optimized.

Now I just need to figure out what to do about it.

------

[1] https://searchfox.org/mozilla-central/rev/a0665934fa05158a5a943d4c8b277465910c029c/dom/plugins/ipc/FunctionBroker.h#1462
[2] https://searchfox.org/mozilla-central/rev/b9a5abc6cb161e0bc0bf0de5314f26d06f37d64b/dom/plugins/ipc/FunctionBroker.h#1288
Assignee: nobody → davidp99
Flags: needinfo?(davidp99)
Priority: -- → P1
jimm's the only person who has much exposure to this code and he's on PTO so you're getting pegged.  (I doubt the FunctionBroker aspect will be a problem for you to review but it can wait if needed.  Its really just a normal concurrency bug.)

I think what I said in comment 2 about the cause of this crash must be right but I don't want to continue to play with edge cases of Win32 condition variables so I changed the monitor to be ref-counted.  This not only makes the UAF crash impossible but, as a side-effect, it negates the need to hold the mutex during the call to Notify to handle spurious waking (since the only concern there was that the original thread would destroy the monitor -- which it did anyway, causing this bug -- and which is impossible now that its ref-counted).

Why don't we see this elsewhere?  From what I can tell, in other places in code, our monitors are member variables and globals -- objects whose lifetime is not tied to the stack.  Therefore, its much easier for them to coordinate shared monitor access.  My solution in this patch is an attempt to get that same easy reasoning here.

-----

Tests: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e101728349af4f2f4ab76a90bbd78d88024079a9
Attachment #8970382 - Flags: review?(bobowencode)
Comment on attachment 8970382 [details] [diff] [review]
Ref-count the plugin FunctionBroker mutex

Review of attachment 8970382 [details] [diff] [review]:
-----------------------------------------------------------------

My knowledge of threading issues isn't great, but this looks like it'll work and I don't see any obvious pitfalls.
Attachment #8970382 - Flags: review?(bobowencode) → review+
Keywords: checkin-needed
Pushed by csabou@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1e02fd7fa20c
Ref-count the plugin FunctionBroker mutex. r=bobowen
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/1e02fd7fa20c
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
this is a new crash in firefox 60. how confident would you feel in uplifting the patch to beta (it's rather late in the cycle now)?
Flags: needinfo?(davidp99)
This is not super easy to answer.  Mostly, I'd say the patch is at least as safe as what we have so we should go for it _but_ this has proven to be a delicate part of code and I could be missing something.  If we could hold until Monday (4/30) for uplift to see if anything explodes then I'd be completely behind it.  If we can't wait that long then I'd still say yes but I'd be sweating.

Its also not 100% that this will fix the issue.  Waiting would clarify that, too.

Will it present problems if we hold this to Monday?
Flags: needinfo?(davidp99)
No crashes on 61 since this landed. Calling this verified.
Status: RESOLVED → VERIFIED
Comment on attachment 8970382 [details] [diff] [review]
Ref-count the plugin FunctionBroker mutex

Looks good to me too.

Approval Request Comment
[Feature/Bug causing the regression]:
bug 1436253, which itself was a bug fix for the plugin FunctionBroker introduced in bug 1382251

[User impact if declined]:
Intermittent plugin crashes due to bad luck with concurrency

[Is this code covered by automated tests?]:
No

[Has the fix been verified in Nightly?]:
yes

[Needs manual test from QE? If yes, steps to reproduce]: 
No

[List of other uplifts needed for the feature/fix]:
None.  Depends on bug 1436253 but that went into 60.0a1

[Is the change risky?]:
Not more than not accepting the change would be.

[Why is the change risky/not risky?]:
The existing code crashes the plugin process.  The worst this code could do is the same but behavior on nightly suggests otherwise.  The code in this patch eliminates the possibility of the race condition that led to many of the crashes.

[String changes made/needed]:
None
Attachment #8970382 - Flags: approval-mozilla-beta?
Comment on attachment 8970382 [details] [diff] [review]
Ref-count the plugin FunctionBroker mutex

fix plugin related crashes, regression in 60, approved for 60.0 rc2
Attachment #8970382 - Flags: approval-mozilla-release+
Attachment #8970382 - Flags: approval-mozilla-esr60+
Attachment #8970382 - Flags: approval-mozilla-beta?
Flags: qe-verify-
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.