Closed Bug 1024765 Opened 9 years ago Closed 9 years ago

Refcounting of nsTimerImpl looks wrong in in TimerThread when PostTimerEvent fails

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox30 --- wontfix
firefox31 + fixed
firefox32 + fixed
firefox33 + fixed
firefox-esr24 31+ fixed
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: bwc, Assigned: bwc)

References

Details

(Keywords: csectype-race, csectype-uaf, sec-high, Whiteboard: [adv-main31+][adv-esr24.7+][qa-])

Attachments

(4 files, 6 obsolete files)

8.25 KB, patch
bwc
: review+
abillings
: sec-approval+
Details | Diff | Splinter Review
2.82 KB, patch
bzbarsky
: review+
Details | Diff | Splinter Review
2.83 KB, patch
Details | Diff | Splinter Review
8.38 KB, patch
Details | Diff | Splinter Review
Noticed this while working on bug 941751. In TimerThread, we call nsTimerImpl::PostTimerEvent (just after doing an addref/release while removing the timer from mTimers), which does the following:

http://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.cpp#722

The constructor of nsTimerEvent does not addref (because we are essentially transferring a reference from mTimers to the nsTimerEvent):

http://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.cpp#116

It does set itself up to perform a release once it is destroyed, which makes sense:

http://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.cpp#157

If we fail for any reason (other than OOM), the nsTimerEvent is destroyed prior to the return, and the refcount is decremented. Then, we decrement the refcount _again_ here, leading to a very misleading logging statement for yours truly, and probably a dangling nsITimer pointer somewhere:

http://dxr.mozilla.org/mozilla-central/source/xpcom/threads/TimerThread.cpp#266
Attached patch Patch that demonstrates the bug. (obsolete) — Splinter Review
Assignee: nobody → docfaraday
Status: NEW → ASSIGNED
Assignee: docfaraday → nobody
Debug builds are probably safe here most of the time (since we'll usually hit the assertion), but if that assertion doesn't fire I'm not sure what will happen.
Status: ASSIGNED → NEW
Byron, did you plan to fix this?
I can take a crack at it I suppose.
Ok, I see a few ways that might make sense to do this:

1) Have PostTimerEvent do a brute-force forget().take() on the timer to relinquish the reference when an error is about to be returned.

2) Refactor to have PostTimerEvent() take an already_AddRefed<nsTimerImpl>& inout param that the function may or may not take (if not, the caller will still hold the reference).

3) Make nsresult an out param and return an already_AddRefed containing the timer in the event of failure.


I kinda like option 2.
Option 1. Seems like the least amount of code and the smallest change, but it could be argued that some refactoring should be done here.
Assignee: nobody → docfaraday
Status: NEW → ASSIGNED
Comment on attachment 8440133 [details] [diff] [review]
Part 2: Do not take a reference in PostTimerEvent in the case of failure.

Review of attachment 8440133 [details] [diff] [review]:
-----------------------------------------------------------------

This seems like the most minimal fix, but would you prefer that I rewrite the code in TimerThread to stop using direct refcount manipulation?
Attachment #8440133 - Flags: feedback?(bzbarsky)
Comment on attachment 8440133 [details] [diff] [review]
Part 2: Do not take a reference in PostTimerEvent in the case of failure.

This looks fine, but I'd just make ForgetTimer() return void and move the take() into it.  And maybe in the comments indicate that it's the destructor of nsTimerEvent that we don't want to decrement the refcount?
Attachment #8440133 - Flags: feedback?(bzbarsky) → feedback+
(In reply to Boris Zbarsky [:bz] from comment #8)
> Comment on attachment 8440133 [details] [diff] [review]
> Part 2: Do not take a reference in PostTimerEvent in the case of failure.
> 
> This looks fine, but I'd just make ForgetTimer() return void and move the
> take() into it.  And maybe in the comments indicate that it's the destructor
> of nsTimerEvent that we don't want to decrement the refcount?

The reason I had it return already_AddRefed is to make it easier to implement the more explicit options 2 or 3, should we decide to go that route (now, or someday later). Also, the scary (void)event->ForgetTimer().take() makes it more obvious that some funky reference counting stuff is going on, and that the code in |PostTimerEvent| needs to be treated carefully. Just calling ForgetTimer() might not jump out in the same way.

I should point out that I am quite willing to implement option 2 or 3, but I'm not all that familiar with the conventions that are used when implementing an optional reference transfer, so I don't know which fits the rest of the code better (or if there's another variant I haven't thought of). I worry that option 1 leaves |PostTimerEvent| fragile, and also paints a pretty obvious bullseye on the problem, whereas options 2 and 3 leave us in a better place going forward, and just look like code cleanup.
Ah, ok.  In that case, I think the best solution here is actually a combination of 2 and 3: an already_AddRefed in param and an already_AddRefed return value, with the return value null on success, non-null on failure.  The caller ignores the actual return value anyway, so there's no point propagating out a particular nsresult.
Alright, I'll give that a go. Thanks!
An alternative approach.
Restore some ordering.
Attachment #8440784 - Attachment is obsolete: true
Remove an outdated comment.
Attachment #8440819 - Attachment is obsolete: true
Comment on attachment 8440835 [details] [diff] [review]
Part 2: Make refcounting logic around PostTimerEvent more explicit.

Review of attachment 8440835 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/TimerThread.cpp
@@ +247,2 @@
>            RemoveTimerInternal(timer);
> +          timer = nullptr;

Setting this here to avoid use in the code below; should be using timerRef instead.

@@ +267,5 @@
> +            if (timerRef) {
> +              // We got our reference back due to an error.
> +              // Unhook the nsRefPtr, and release manually so we can get the
> +              // refcount.
> +              nsrefcnt rc = timerRef.forget().take()->Release();

Is there a simpler way to do this?

::: xpcom/threads/nsTimerImpl.cpp
@@ +732,2 @@
>  
> +  nsRefPtr<nsTimerEvent> event = new nsTimerEvent;

If we wanted to pass the timer as an already_AddRefed here, we'd need to keep an extra reference around so we can do the stuff below. I can do that if it is preferable, just let me know.
Attachment #8440835 - Flags: review?(bzbarsky)
CC'ing maire and jesup because this bug is related to bug 941751 (one of our oranges).
Also CC'ing Nils and Shell by request.
Comment on attachment 8440835 [details] [diff] [review]
Part 2: Make refcounting logic around PostTimerEvent more explicit.

> Is there a simpler way to do this?

No.  :(

>+nsTimerImpl::PostTimerEvent(already_AddRefed<nsTimerImpl> timerRef)

aTimerRef?

>+  nsRefPtr<nsTimerEvent> event = new nsTimerEvent;

This can't return null.  Does that make it simpler to just go ahead and pass aTimerRef to the constructor here?  We could then use event->Timer() (which would return nsTimerImpl*) for whatever we need the timer for, and event->ForgetTimer() for error returns.  Then we don't need a SetTimer().

r=me
Attachment #8440835 - Flags: review?(bzbarsky) → review+
(In reply to Boris Zbarsky [:bz] from comment #18)
> Comment on attachment 8440835 [details] [diff] [review]
> Part 2: Make refcounting logic around PostTimerEvent more explicit.
> 
> >+  nsRefPtr<nsTimerEvent> event = new nsTimerEvent;
> 
> This can't return null.  Does that make it simpler to just go ahead and pass
> aTimerRef to the constructor here?  We could then use event->Timer() (which
> would return nsTimerImpl*) for whatever we need the timer for, and
> event->ForgetTimer() for error returns.  Then we don't need a SetTimer().
> 

   Actually, we override operator new to alloc from a memory pool for this, so it might not be infallible:

http://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.cpp?from=nsTimerEvent#138

   That in itself probably merits a comment.
Hopefully this fixes our intermittent Bug 941751
Blocks: 941751
Oh, ouch.  Yes, that can definitely use a comment!
Attachment #8439584 - Attachment is obsolete: true
Incorporate feedback, and suppress an unused variable warning.
Attachment #8440835 - Attachment is obsolete: true
Comment on attachment 8440939 [details] [diff] [review]
Part 1: Add test-case for timers firing when the target thread is not running.

Review of attachment 8440939 [details] [diff] [review]:
-----------------------------------------------------------------

Might be a good idea to land this too?
Attachment #8440939 - Flags: feedback?(bzbarsky)
Attachment #8440133 - Attachment is obsolete: true
Comment on attachment 8440939 [details] [diff] [review]
Part 1: Add test-case for timers firing when the target thread is not running.

Yes, please!
Attachment #8440939 - Flags: feedback?(bzbarsky) → feedback+
Comment on attachment 8440941 [details] [diff] [review]
Part 2: Make refcounting logic around PostTimerEvent more explicit.

Review of attachment 8440941 [details] [diff] [review]:
-----------------------------------------------------------------

Carry forward r=bz
Attachment #8440941 - Flags: review+
Seems to be working, will give the test-case another scrub tomorrow morning since it was just a hacked-up copy/paste job.

https://tbpl.mozilla.org/?tree=Try&rev=677d6e237448
It should be an assertable problem for a timer to stay alive when the target thread is not running. I think we should assert that there, but also consider whether we can assert it at nsThread::Shutdown also.
I don't know about that. Cancel() does not prevent TimerThread from trying to dispatch; it only prevents the timer from popping once it hits the target thread. So, if Cancel() races the TimerThread, it is very easy for a dispatch to the target thread to fail.
Why can't it prevent the dispatch? In particular I expect the following sequence should not attempt to dispatch to a dead thread:

mTimer->Cancel();
mThread->Shutdown();

This pattern should really assert if we can make it:

mThread->Shutdown(); // ASSERT: Active timer pointing at dying thread
mTimer->Cancel();
Sounds bad, so I'm going to mark this sec-high.  Please get sec-approval before landing.
Looking at the code again, it does look like Cancel() should be removing the timer from TimerThread's set of timers, so I'll need to look into why we don't learn that our thread is shutting down until after dispatches are being denied.
Ok, there is still a race here. Let's say the TimerThread is cruising along, and decides to fire a timer. We end up going through the motions, and end up here:

http://dxr.mozilla.org/mozilla-central/source/xpcom/threads/TimerThread.cpp#251

We have removed the timer from mTimers at this point, and we unlock. Then, on the target thread, we call Cancel(). This flips the mCancelled bit, and tries to remove from mTimers again (with no effect). Since the monitor is unlocked (see above), we return straightaway:

http://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.cpp#434

Then, the target thread shuts down, and TimerThread continues execution where it left off, and attempts to post the timer.
I should clarify; the race above is pretty much benign once the above patches are in place, since we recover from the failed dispatch. But if we add an assert, we're going to hit it occasionally even when the user of the timer is doing the right thing.
Attachment #8441653 - Flags: review?(bzbarsky)
Attachment #8440939 - Attachment is obsolete: true
Comment on attachment 8441653 [details] [diff] [review]
Part 1: Add test-case for timers firing when the target thread is not running.

r=me.  Does this test actually fail without the part 2 patch?
Attachment #8441653 - Flags: review?(bzbarsky) → review+
(In reply to Boris Zbarsky [:bz] from comment #36)
> Comment on attachment 8441653 [details] [diff] [review]
> Part 1: Add test-case for timers firing when the target thread is not
> running.
> 
> r=me.  Does this test actually fail without the part 2 patch?

Yes, we hit the rc != 0 assertion. I could do a non-debug ASAN build and verify that we double free, but that would take quite a bit of time.

Needinfo self to check back on try push.

https://tbpl.mozilla.org/?tree=Try&rev=638b6db666c1
Flags: needinfo?(docfaraday)
Flags: needinfo?(docfaraday) → sec-review?
Oh now this is interesting; the Marionette tests are very sad on windows with these patches.
Trying again with a more recent pull:

https://tbpl.mozilla.org/?tree=Try&rev=5d297691df6a
Comment on attachment 8440941 [details] [diff] [review]
Part 2: Make refcounting logic around PostTimerEvent more explicit.

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

   For code that calls Cancel() on its timers before thread shutdown, it is an extremely tight race to hit this bug. If there was something content could do to cause very short lifetime timers to be scheduled, and then cause thread shutdown, it might be able to trigger the race in comment 33 with some frequency, but it is hard to say (this is my current hypothesis as to the cause of bug 941751). For code that does not call Cancel() when it is supposed to, it becomes much easier, but would require content to cause the target thread in question to shut down in some way (possibly by inducing the user to close the browser, for example). However, it seems doubtful that we have code that does this, or else we'd see this assertion being hit very frequently. 

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

   The test-case might provide a hint, but the bugfix looks like code cleanup.

Which older supported branches are affected by this flaw?

   Probably all; the patch on bug 744836 introduced this.

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

   I do not have backports, but given how little the code has changed, I do not think they will be particularly hard to construct.

How likely is this patch to cause regressions; how much testing does it need?

   Well, this patch is a change in refcounting logic, on a widely-used feature. I think I've gotten it right, but there is some risk. At least we're making use of the safer nsRefPtr and already_AddRefed (the only departure from this is where we check the refcount when PostTimerEvent fails, but we know for sure the existing code would do something nasty here).
Attachment #8440941 - Flags: sec-approval?
Flags: sec-review?
Comment on attachment 8440941 [details] [diff] [review]
Part 2: Make refcounting logic around PostTimerEvent more explicit.

sec-approval+ for trunk.

We should get Aurora, Beta, and ESR24 patches for this.
Attachment #8440941 - Flags: sec-approval? → sec-approval+
I'll go ahead and cook those up.
Patches seem to apply cleanly to aurora. Verifying that they build and then running some tests.
Patches apply cleanly to beta. Checking build-tests there now.
Comment on attachment 8441653 [details] [diff] [review]
Part 1: Add test-case for timers firing when the target thread is not running.

aurora and beta seem to do what I'd expect with the patches as they are. Not sure if we want to land the test case too, but here's the request.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 

   bug 744836

User impact if declined:

   No unit test in CI for the bugfix.

Testing completed (on m-c, etc.):

   This test is being run in CI.
   
Risk to taking this patch (and alternatives if risky): 

   None apart from test failures.

String or IDL/UUID changes made by this patch:

   None.
Attachment #8441653 - Flags: approval-mozilla-beta?
Attachment #8441653 - Flags: approval-mozilla-aurora?
Comment on attachment 8440941 [details] [diff] [review]
Part 2: Make refcounting logic around PostTimerEvent more explicit.

aurora and beta seem to work fine with the patches as-is

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 

   bug 744836

User impact if declined: 

   Rare UAF when shutting down a thread that uses timers.

Testing completed (on m-c, etc.):

   CI testing on try, plus a new test-case in TestTimers.

Risk to taking this patch (and alternatives if risky):

   There is some risk here, since we're messing with refcounts on a widely-used feature. I suppose an alternative would be to abort() if the dispatch fails, so we have a controlled crash.

String or IDL/UUID changes made by this patch:

   None.
Attachment #8440941 - Flags: approval-mozilla-beta?
Attachment #8440941 - Flags: approval-mozilla-aurora?
Keywords: checkin-needed
Comment on attachment 8443829 [details] [diff] [review]
Part 1: (patch for esr24) Add test-case for timers firing when the target thread is not running.

[Approval Request Comment]

See separate comment for aurora/beta approval.
Attachment #8443829 - Flags: approval-mozilla-esr24?
Comment on attachment 8443830 [details] [diff] [review]
Part 2: (patch for esr24) Make refcounting logic around PostTimerEvent more explicit.

[Approval Request Comment]

See comment for aurora/beta approval.
Attachment #8443830 - Flags: approval-mozilla-esr24?
Attachment #8443830 - Flags: approval-mozilla-esr24? → approval-mozilla-esr24+
Attachment #8443829 - Flags: approval-mozilla-esr24? → approval-mozilla-esr24+
Comment on attachment 8441653 [details] [diff] [review]
Part 1: Add test-case for timers firing when the target thread is not running.

Taking the 2 patches for aurora. If no issues are found, I will accept them for beta 5 (end of the week)
Attachment #8441653 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8440941 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8440941 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8441653 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Blocks: 1036531
Whiteboard: [adv-main31+][adv-esr24.7+]
Marking [qa-] for verification purposes, based on comments and lack of testable media / STR. Please feel free to provide if you'd like help in testing, thank you.
Whiteboard: [adv-main31+][adv-esr24.7+] → [adv-main31+][adv-esr24.7+][qa-]
Alias: CVE-2014-1553
Alias: CVE-2014-1553
Group: core-security → core-security-release
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.