Closed Bug 1033121 Opened 10 years ago Closed 10 years ago

Race in nsTimerEvent destructor

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla33
blocking-b2g 1.4+
Tracking Status
firefox31 --- wontfix
firefox32 --- fixed
firefox33 --- fixed
firefox-esr24 --- wontfix
firefox-esr31 --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: jwwang, Assigned: jwwang)

References

Details

(Keywords: csectype-race, sec-moderate, Whiteboard: [adv-main32+][adv-esr31.1+])

Attachments

(1 file)

I found this problem when debugging timeouts in test_playback_rate.html. [1]

It looks like it is possible for |event| [2] to be executed in the target thread before nsTimerImpl::PostTimerEvent returns. Then nsTimerEvent will be deleted on exit of nsTimerImpl::PostTimerEvent in the timer thread. In the destructor of nsTimerEvent, it will call nsTimerImpl::Release in the timer thread. At the same time, it is also possible for nsTimerImpl::InitWithCallback to be called in the target thread. Therefore we might hit bug 997844 comment 0 again.


[1] bug 1020707 comment 42
[2] http://hg.mozilla.org/mozilla-central/file/3a28d3bb4f55/xpcom/threads/nsTimerImpl.cpp#l763
Hi Boris,
Can you check if this is a problem of nsTimer?
Flags: needinfo?(bzbarsky)
Group: core-security
Dupe of bug 1024765 maybe?
Doesn't seem so. Does Dispatch not take the reference of the passed Runnable when it succeeds? If not, then yeah we could end up back in the scenario described in bug 997844. Ideally, the behavior I would want from Dispatch is an optional reference transfer (ie; if it works, the reference is transferred, otherwise it is not), but this may not be feasible. If it isn't, I think we'd have to refactor so that we did not need to unlock the monitor here:

http://dxr.mozilla.org/mozilla-central/source/xpcom/threads/TimerThread.cpp#253

This would prevent the target thread from getting to AddTimerInternal while all this is going on, which would also fix the race in bug 1024765 comment 33 I think.
The signature is Dispatch(nsIRunnable *event, uint32_t flags). The event target can only add reference but not take reference on success.
> It looks like it is possible for |event| [2] to be executed in the target thread before
> nsTimerImpl::PostTimerEvent returns.

Yes, this is always possible for event-dispatching situations.  In particular, |event| can execute before Dispatch() has returned...

You're right that this is a problem.

I think the right thing to do here is to null out mTimer in nsTimerEvent::Run() before any of its return statements.  Agreed?
Flags: needinfo?(bzbarsky) → needinfo?(jwwang)
That seems like it would work.
yeah.
Flags: needinfo?(jwwang)
Blocks: 1020707
Is there anyone gonna take and work on this issue?
Please feel free to write the patch and request review?
OS: Android → All
Hardware: ARM → All
Release |mTimer| in nsTimerEvent::Run() instead of ~nsTimerEvent() to avoid race.
Assignee: nobody → jwwang
Status: NEW → ASSIGNED
Attachment #8451627 - Flags: review?(bzbarsky)
Comment on attachment 8451627 [details] [diff] [review]
1033121_fix_race_in_nsTimerEvent_destructor-v1.patch

r=me.  I can't think of as sane way to add a testcase for this.  :(
Attachment #8451627 - Flags: review?(bzbarsky) → review+
How about adding assertions to check most nsITimer functions should be called in the target thread.
try: https://tbpl.mozilla.org/?tree=Try&rev=10b52bb4f343
most green.
Keywords: checkin-needed
> How about adding assertions

That might work, yes.  Want to give it a shot on try without this patch?
Does this need to go through sec-approval?
Flags: needinfo?(continuation)
Keywords: checkin-needed
Is it exploitable and rated sec-high or sec-crit?  Does it affect more than trunk?  If yes to both, then it does.

It seems kind of sec-high-ish to me, because it sounds like it could lead to a use after free or something.
Flags: needinfo?(continuation)
Keywords: csectype-race
(I was kind of hoping you would rate it ...)
I don't know what is actually going wrong, and I don't understand comment 0 here or in bug 997844.
So, I don't think this could lead to a UAF. The fundamental problem here is that an nsTimerImpl could fail to be scheduled if the user gets really unlucky. This could cascade and cause other problems, but likely not a UAF (more likely stuff like leaks and hangs, and asserts in debug builds probably). Coupled with the fact that this is an extremely tight race, it is probably somewhat difficult to trigger in an attack.

@jwwang Were you seeing dangerous-looking crashes (ie; anything other than an assert) with this?
Flags: needinfo?(jwwang)
It seems theoretically possible to cause use-after-free or memory corruption if [1] is run in the timer thread after [2] is run in the target thread when 2 threads are calling nsTimerImpl::Release() at the same time.
[1] http://hg.mozilla.org/mozilla-central/file/7f9db2379b3f/xpcom/threads/nsTimerImpl.cpp#l264
[2] http://hg.mozilla.org/mozilla-central/file/7f9db2379b3f/xpcom/threads/nsTimerImpl.cpp#l230

Btw, what is UAF?
Flags: needinfo?(jwwang) → needinfo?(bzbarsky)
UAF = Use After Free
(In reply to JW Wang [:jwwang] from comment #20)
> It seems theoretically possible to cause use-after-free or memory corruption
> if [1] is run in the timer thread after [2] is run in the target thread when
> 2 threads are calling nsTimerImpl::Release() at the same time.

Ah, that's lovely. We've managed to take a threadsafe refcount, and make it non-threadsafe by overriding Release to do a whole bunch of extra stuff.
Flags: needinfo?(bzbarsky)
Btw, there is also a race in nsTimerImpl::SetDelayInternal() which is called by nsTimerImpl::PostTimerEvent which is called in the timer thread. We should really think about making nsTimerImpl thread-safe. Lockless code is just hard to get right.
(In reply to JW Wang [:jwwang] from comment #23)
> Btw, there is also a race in nsTimerImpl::SetDelayInternal() which is called
> by nsTimerImpl::PostTimerEvent which is called in the timer thread. We
> should really think about making nsTimerImpl thread-safe. Lockless code is
> just hard to get right.

Yes, I believe we do have another race here, because we have no assurance that |Fire| will not be called on the target thread at the same time as |PostTimerEvent| is called on the timer thread, either because we're using repeating precise timers, or because something racy has happened with |mGeneration| (yes, there is a race here too, because it is written on the target thread and read on the timer thread with no sync barrier).

IMO, What we really need to do is greatly expand unit-test for this, and run it through its paces with TSAN.
Blocks: 1036531
Shall we take this patch while re-design is in progress?
Yes, absolutely!
So the way forward is to request sec-approval, write comments about that, and then put "check-in-needed" again?
sec-moderates don't need approval.
https://wiki.mozilla.org/Security/Bug_Approval_Process
Keywords: checkin-needed
https://hg.mozilla.org/integration/mozilla-inbound/rev/2f00e114a489

Probably worth nominating this for at least Aurora uplift. Not sure about beta/b2g30 given where we are in the release cycle. I can always disable my way to victory on any media tests being bitten by this if need-be.
https://hg.mozilla.org/mozilla-central/rev/2f00e114a489
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Blocks: 668973
Blocks: 886188
Comment on attachment 8451627 [details] [diff] [review]
1033121_fix_race_in_nsTimerEvent_destructor-v1.patch

Approval Request Comment
[Feature/regressing bug #]:unknown
[User impact if declined]:timer callback will cease to fire which will impact the overall system functionality.
[Describe test coverage new/current, TBPL]:
aurora - https://tbpl.mozilla.org/?tree=Try&rev=cb7a6367e147
beta - https://tbpl.mozilla.org/?tree=Try&rev=2e704d9d39d6
It already runs fine on inbound/central for a week.
[Risks and why]: low for the change is simple and root cause is well known
[String/UUID change made/needed]:none
Attachment #8451627 - Flags: approval-mozilla-beta?
Attachment #8451627 - Flags: approval-mozilla-aurora?
Comment on attachment 8451627 [details] [diff] [review]
1033121_fix_race_in_nsTimerEvent_destructor-v1.patch

Too late for 31.

Not sure it is critical enough for ESR 31.
Attachment #8451627 - Flags: approval-mozilla-esr31?
Attachment #8451627 - Flags: approval-mozilla-beta?
Attachment #8451627 - Flags: approval-mozilla-beta-
Attachment #8451627 - Flags: approval-mozilla-aurora?
Attachment #8451627 - Flags: approval-mozilla-aurora+
I don't think so. For now it seems only Android will hit this bug.
https://hg.mozilla.org/releases/mozilla-aurora/rev/86f331357783

Not sure if this is something that could affect real B2G devices in the wild or not.
blocking-b2g: --- → 1.4?
Comment on attachment 8451627 [details] [diff] [review]
1033121_fix_race_in_nsTimerEvent_destructor-v1.patch

[Approval Request Comment]
Bug caused by (feature/regressing bug #): unknown
User impact if declined: timer callback will cease to fire which will impact the overall system functionality.
Testing completed: 
It has run fine on inbound/central for about 2 weeks.
try: https://tbpl.mozilla.org/?tree=Try&rev=95651f209cc9
Risk to taking this patch (and alternatives if risky): low for the change is simple and root cause is well known
String or UUID changes made by this patch:none
Attachment #8451627 - Flags: approval-mozilla-b2g30?
Add more Mozilla person
Let's land this in v1.4.

Hi JW,

Do we need to rebase the patch for v1.4. Thanks?
blocking-b2g: 1.4? → 1.4+
Flags: needinfo?(jwwang)
No, I could apply it cleanly on b2g30.
Flags: needinfo?(jwwang)
Attachment #8451627 - Flags: approval-mozilla-b2g30?
https://hg.mozilla.org/releases/mozilla-b2g30_v1_4/rev/3cf46534c321
Comment on attachment 8451627 [details] [diff] [review]
1033121_fix_race_in_nsTimerEvent_destructor-v1.patch

Taking it because ESR31 is also used for Android 2.2 and ARMv6 31 release...
Attachment #8451627 - Flags: approval-mozilla-esr31? → approval-mozilla-esr31+
https://hg.mozilla.org/releases/mozilla-esr31/rev/3d496623c725
Marking qe-verify- for verification purposes due to lack of test case or easier STR. Please feel free to provide if you'd like me to verify the fix. Thank you.
Flags: qe-verify-
Whiteboard: [adv-main32+][adv-esr31.1+]
Group: core-security → core-security-release
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: