Various rare crashes in nsTimer, nsTimerImpl, nsTimerEvent

RESOLVED FIXED in Firefox 54

Status

()

defect
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: bwc, Assigned: bwc)

Tracking

(Blocks 1 bug, 4 keywords)

48 Branch
mozilla55
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr45 unaffected, firefox52 wontfix, firefox-esr52 wontfix, firefox53 wontfix, firefox54+ fixed, firefox55+ fixed, firefox56+ fixed, firefox57+ fixed, firefox58 fixed)

Details

(Whiteboard: [post-critsmash-triage])

Attachments

(3 attachments, 3 obsolete attachments)

crash-stats has a smallish background rate of crashes that look like memory management problems (the races in nsTimerImpl itself seem to be fixed, though).

https://crash-stats.mozilla.com/search/?signature=~nsTimer&version=52.0b5&date=%3E%3D2017-02-07T21%3A13%3A50.000Z&date=%3C2017-02-14T21%3A13%3A50.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

In particular, the crashes in nsTimerEvent::Run are suspicious, and hint that something racy is being done to nsTimerEvent::mTimer (perhaps nsTimerEvent::Cancel is racing?).
Group: core-security → dom-core-security
I'm going to mark this sec-high, but maybe lower would be okay, too.
Keywords: sec-high
MozReview-Commit-ID: 6br6DaDqxR0
Attachment #8838611 - Flags: review?(nfroyd)
Assignee: nobody → docfaraday
Status: NEW → ASSIGNED
MozReview-Commit-ID: BXCGYWnFqSj
Attachment #8838613 - Flags: review?(nfroyd)
These changes might do the trick. Hard to say for sure.
Comment on attachment 8838611 [details] [diff] [review]
Part 1: Simplify nsTimerEvent::Cancel, since there's no need to release the nsTimerImpl itself

Review of attachment 8838611 [details] [diff] [review]:
-----------------------------------------------------------------

I don't think this patch is right, but I'm happy to hear counterarguments as to why I'm wrong.

::: xpcom/threads/TimerThread.cpp
@@ -284,5 @@
>  {
> -  if (!mTimer) {
> -    MOZ_ASSERT(false);
> -    return NS_OK;
> -  }

Why is this block safe to remove?  If the timer gets Cancel'd after the event is already in the event queue, we don't want Run() to do anything.

::: xpcom/threads/nsTimerImpl.cpp
@@ +128,5 @@
>      mImpl->Cancel();
>      mImpl = nullptr;
> +    // This might not be 0, technically speaking, but as far as the caller of
> +    // Release is concerned this object is dead.
> +    return 0;

I'm skeptical of this change, since we use the return value of this method for general invariant checking.
Attachment #8838611 - Flags: review?(nfroyd)
Comment on attachment 8838613 [details] [diff] [review]
Part 2: Help prevent nullptr crashes due to misuse of the timer API

Review of attachment 8838613 [details] [diff] [review]:
-----------------------------------------------------------------

Canceling review while we get this worked out.

::: xpcom/threads/nsTimerImpl.cpp
@@ +452,5 @@
>      oldDelay = mDelay;
>      oldTimeout = mTimeout;
> +    // Ensure that the nsITimer does not unhook from the nsTimerImpl during
> +    // Fire; this will cause null pointer crashes if the user of the timer drops
> +    // its reference, and then uses the nsITimer* passed in the callback.

The scenario here is that:

T1: Create timer.
T1: timer->SetTarget(T2)
....time passes
T2: nsTimeImpl::Fire is running.
T1: null out nsCOMPtr to timer (nsITimer).
T2: Pass nsITimer* from strong reference, which is bogus

Is that right?  TimerImpl is still holding a strong reference to nsITimer* thing, though...or does that get removed somehow?
Attachment #8838613 - Flags: review?(nfroyd)
Comment on attachment 8838611 [details] [diff] [review]
Part 1: Simplify nsTimerEvent::Cancel, since there's no need to release the nsTimerImpl itself

Review of attachment 8838611 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/TimerThread.cpp
@@ -284,5 @@
>  {
> -  if (!mTimer) {
> -    MOZ_ASSERT(false);
> -    return NS_OK;
> -  }

If nsTimerEvent::Cancel() is called, we call nsTimerImpl::Cancel() which will prevent nsTimerEvent::Run() from doing anything besides logging. Are you saying to want to avoid the logging below if either the Runnable or the timer itself is cancelled? That could be done, I suppose.

::: xpcom/threads/nsTimerImpl.cpp
@@ +128,5 @@
>      mImpl->Cancel();
>      mImpl = nullptr;
> +    // This might not be 0, technically speaking, but as far as the caller of
> +    // Release is concerned this object is dead.
> +    return 0;

I worry that something will see a return of 1 from NS_RELEASE and assume the object isn't dead yet (which would be an invalid assumption in the general case). If you think I shouldn't worry about this, I can remove.
Flags: needinfo?(nfroyd)
(In reply to Byron Campen [:bwc] from comment #9)
> ::: xpcom/threads/TimerThread.cpp
> @@ -284,5 @@
> >  {
> > -  if (!mTimer) {
> > -    MOZ_ASSERT(false);
> > -    return NS_OK;
> > -  }
> 
> If nsTimerEvent::Cancel() is called, we call nsTimerImpl::Cancel() which
> will prevent nsTimerEvent::Run() from doing anything besides logging. Are
> you saying to want to avoid the logging below if either the Runnable or the
> timer itself is cancelled? That could be done, I suppose.

Oh, oh, I see.  Sorry I got mixed up here: nsTimerImpl::Fire will take care of all the details of making the timer not actually do things, not nsTimerEvent::Run.  OK, that part makes sense.

> ::: xpcom/threads/nsTimerImpl.cpp
> @@ +128,5 @@
> >      mImpl->Cancel();
> >      mImpl = nullptr;
> > +    // This might not be 0, technically speaking, but as far as the caller of
> > +    // Release is concerned this object is dead.
> > +    return 0;
> 
> I worry that something will see a return of 1 from NS_RELEASE and assume the
> object isn't dead yet (which would be an invalid assumption in the general
> case). If you think I shouldn't worry about this, I can remove.

I think you shouldn't worry about it; I'd rather make minimal changes to the timer code if possible, and breaking Release's contract doesn't seem like a good idea in any event.
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #8)
> Comment on attachment 8838613 [details] [diff] [review]
> Part 2: Help prevent nullptr crashes due to misuse of the timer API
> 
> Review of attachment 8838613 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Canceling review while we get this worked out.
> 
> ::: xpcom/threads/nsTimerImpl.cpp
> @@ +452,5 @@
> >      oldDelay = mDelay;
> >      oldTimeout = mTimeout;
> > +    // Ensure that the nsITimer does not unhook from the nsTimerImpl during
> > +    // Fire; this will cause null pointer crashes if the user of the timer drops
> > +    // its reference, and then uses the nsITimer* passed in the callback.
> 
> The scenario here is that:
> 
> T1: Create timer.
> T1: timer->SetTarget(T2)
> ....time passes
> T2: nsTimeImpl::Fire is running.
> T1: null out nsCOMPtr to timer (nsITimer).
> T2: Pass nsITimer* from strong reference, which is bogus
> 
> Is that right?  TimerImpl is still holding a strong reference to nsITimer*
> thing, though...or does that get removed somehow?

The scenario is:

T2: nsTimerImpl is running, during which the following happens:
   1. T2 drops its (only) reference to the nsITimer, causing the refcount to go to 1.
   2. nsTimer::Release sees the refcount of 1, then turns around and nulls out nsTimer::mTimerImpl.
   3. Then, still in the callback, T2 uses the nsITimer* param in the callback to grab a _new_ (non-functional!) reference to the nsITimer.

Now we have T2 holding onto an nsITimer with no implementation, leading to a null pointer crash when it is used. Adding another ref during fire prevents the refcount from going to 1, and this scenario from unfolding. We could do the same thing if we could pass an already_addRefed<nsITimer> or something to the callback.
(In reply to Nathan Froyd [:froydnj] from comment #10)
> (In reply to Byron Campen [:bwc] from comment #9)
> > ::: xpcom/threads/nsTimerImpl.cpp
> > @@ +128,5 @@
> > >      mImpl->Cancel();
> > >      mImpl = nullptr;
> > > +    // This might not be 0, technically speaking, but as far as the caller of
> > > +    // Release is concerned this object is dead.
> > > +    return 0;
> > 
> > I worry that something will see a return of 1 from NS_RELEASE and assume the
> > object isn't dead yet (which would be an invalid assumption in the general
> > case). If you think I shouldn't worry about this, I can remove.
> 
> I think you shouldn't worry about it; I'd rather make minimal changes to the
> timer code if possible, and breaking Release's contract doesn't seem like a
> good idea in any event.

Ok, will change.
Attachment #8838611 - Attachment is obsolete: true
Attachment #8839688 - Flags: review?(nfroyd)
Attachment #8838613 - Flags: review?(nfroyd)
(In reply to Byron Campen [:bwc] from comment #11)
> (In reply to Nathan Froyd [:froydnj] from comment #8)
> > Comment on attachment 8838613 [details] [diff] [review]
> > Part 2: Help prevent nullptr crashes due to misuse of the timer API
> > 
> > Review of attachment 8838613 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > Canceling review while we get this worked out.
> > 
> > ::: xpcom/threads/nsTimerImpl.cpp
> > @@ +452,5 @@
> > >      oldDelay = mDelay;
> > >      oldTimeout = mTimeout;
> > > +    // Ensure that the nsITimer does not unhook from the nsTimerImpl during
> > > +    // Fire; this will cause null pointer crashes if the user of the timer drops
> > > +    // its reference, and then uses the nsITimer* passed in the callback.
> > 
> > The scenario here is that:
> > 
> > T1: Create timer.
> > T1: timer->SetTarget(T2)
> > ....time passes
> > T2: nsTimeImpl::Fire is running.
> > T1: null out nsCOMPtr to timer (nsITimer).
> > T2: Pass nsITimer* from strong reference, which is bogus
> > 
> > Is that right?  TimerImpl is still holding a strong reference to nsITimer*
> > thing, though...or does that get removed somehow?
> 
> The scenario is:
> 
> T2: nsTimerImpl is running, during which the following happens:
>    1. T2 drops its (only) reference to the nsITimer, causing the refcount to
> go to 1.
>    2. nsTimer::Release sees the refcount of 1, then turns around and nulls
> out nsTimer::mTimerImpl.
>    3. Then, still in the callback, T2 uses the nsITimer* param in the
> callback to grab a _new_ (non-functional!) reference to the nsITimer.
> 
> Now we have T2 holding onto an nsITimer with no implementation, leading to a
> null pointer crash when it is used. Adding another ref during fire prevents
> the refcount from going to 1, and this scenario from unfolding. We could do
> the same thing if we could pass an already_addRefed<nsITimer> or something
> to the callback.

Ugh, who is doing this?  I wish we didn't have to addref/release an extra check just because people are doing weird things.
Attachment #8839688 - Flags: review?(nfroyd) → review+
Attachment #8838613 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #14)
> (In reply to Byron Campen [:bwc] from comment #11)
> > (In reply to Nathan Froyd [:froydnj] from comment #8)
> > > Comment on attachment 8838613 [details] [diff] [review]
> > > Part 2: Help prevent nullptr crashes due to misuse of the timer API
> > > 
> > > Review of attachment 8838613 [details] [diff] [review]:
> > > -----------------------------------------------------------------
> > > 
> > > Canceling review while we get this worked out.
> > > 
> > > ::: xpcom/threads/nsTimerImpl.cpp
> > > @@ +452,5 @@
> > > >      oldDelay = mDelay;
> > > >      oldTimeout = mTimeout;
> > > > +    // Ensure that the nsITimer does not unhook from the nsTimerImpl during
> > > > +    // Fire; this will cause null pointer crashes if the user of the timer drops
> > > > +    // its reference, and then uses the nsITimer* passed in the callback.
> > > 
> > > The scenario here is that:
> > > 
> > > T1: Create timer.
> > > T1: timer->SetTarget(T2)
> > > ....time passes
> > > T2: nsTimeImpl::Fire is running.
> > > T1: null out nsCOMPtr to timer (nsITimer).
> > > T2: Pass nsITimer* from strong reference, which is bogus
> > > 
> > > Is that right?  TimerImpl is still holding a strong reference to nsITimer*
> > > thing, though...or does that get removed somehow?
> > 
> > The scenario is:
> > 
> > T2: nsTimerImpl is running, during which the following happens:
> >    1. T2 drops its (only) reference to the nsITimer, causing the refcount to
> > go to 1.
> >    2. nsTimer::Release sees the refcount of 1, then turns around and nulls
> > out nsTimer::mTimerImpl.
> >    3. Then, still in the callback, T2 uses the nsITimer* param in the
> > callback to grab a _new_ (non-functional!) reference to the nsITimer.
> > 
> > Now we have T2 holding onto an nsITimer with no implementation, leading to a
> > null pointer crash when it is used. Adding another ref during fire prevents
> > the refcount from going to 1, and this scenario from unfolding. We could do
> > the same thing if we could pass an already_addRefed<nsITimer> or something
> > to the callback.
> 
> Ugh, who is doing this?  I wish we didn't have to addref/release an extra
> check just because people are doing weird things.

Not sure. This is the only way I can think of that would explain some of the crashes I'm seeing. I could be totally wrong, of course. We'll just have to watch crash-stats and see...
Comment on attachment 8838613 [details] [diff] [review]
Part 2: Help prevent nullptr crashes due to misuse of the timer API

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

   This would be really, really hard probably.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

   The problem is subtle. Someone could figure it out. Then again, it is just a nullptr crash.

Which older supported branches are affected by this flaw?

   52 I think.

If not all supported branches, which bug introduced the flaw?

   Bug 1157323

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

   I suspect it won't be a totally clean application, but it shouldn't be too hard.

How likely is this patch to cause regressions; how much testing does it need?

   I can't see how this would make anything less safe.
Attachment #8838613 - Flags: sec-approval?
Comment on attachment 8839688 [details] [diff] [review]
Part 1: Simplify nsTimerEvent::Cancel, since there's no need to release the nsTimerImpl itself

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

   Pretty hard, honestly. It can only be triggered by misuse in c++ land, which would probably be really hard to trigger from JS.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

   Not really, it just looks like a simplification.

Which older supported branches are affected by this flaw?

   I think this was present in 46 on.

If not all supported branches, which bug introduced the flaw?

   Bug 901097.

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

   Until 52, it should be easy. Past that, much much harder, since the necessary threadsafety guarantees just aren't there.

How likely is this patch to cause regressions; how much testing does it need?

   I don't think it is very likely. If something is already leaking Cancelled nsTimerEvents, now nsTimerImpl and nsITimer will be leaked too.
Attachment #8839688 - Flags: sec-approval?
We'll need branch patches hee but this may have missed the last beta. I'll need to loop in release management to see if we can take this.
Flags: needinfo?(rkothari)
Flags: needinfo?(lhenry)
Flags: needinfo?(jcristau)
I'm on the fence here, leaning towards not taking this for 52 since we are just about to go to build with the last beta and it may be hard to exploit. If julien disagrees we could still take it for next monday's RC build.
Flags: needinfo?(lhenry)
I'm kinda tempted to say land in nightly, and see what happens before we decide to uplift, since I'm not 100% sure these crashes are caused by the problems these patches fix.
I agree with comment 19.
Flags: needinfo?(rkothari)
Flags: needinfo?(jcristau)
(In reply to Byron Campen [:bwc] from comment #20)
> I'm kinda tempted to say land in nightly, and see what happens before we
> decide to uplift, since I'm not 100% sure these crashes are caused by the
> problems these patches fix.

All right. I'll sec-approval+ this but let's only take it on nightly for now.
Attachment #8838613 - Flags: sec-approval? → sec-approval+
Attachment #8839688 - Flags: sec-approval? → sec-approval+
Ugh, lots of permaorange on that push. Looks like it had other causes, but I'll do another try based on latest m-c to be sure.
https://hg.mozilla.org/integration/mozilla-inbound/rev/e133ed915c7685885f67f589fc2f72f0ddfe35eb
Bug 1339588 - Part 1: Simplify nsTimerEvent::Cancel, since there's no need to release the nsTimerImpl itself. r=froydnj

https://hg.mozilla.org/integration/mozilla-inbound/rev/56fbe9964a0bccccb6b75a2be4190f52f7a0a502
Bug 1339588 - Part 2: Help prevent nullptr crashes due to misuse of the timer API. r=froydnj
https://hg.mozilla.org/mozilla-central/rev/e133ed915c76
https://hg.mozilla.org/mozilla-central/rev/56fbe9964a0b
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Group: dom-core-security → core-security-release
Track 53+/54+/55+ as sec-high.

Hi :bwc,
Since this bug also affects 53/54, do you think it's worth uplifting to 54 at least?
Flags: needinfo?(docfaraday)
We want to see whether these patches actually help the crash rate before uplifting anything.
Flags: needinfo?(docfaraday)
Byron, can you tell if the crash rate improved here? Do you need more help analyzing crash-stats results?
Flags: needinfo?(docfaraday)
It seems to have helped some, but there's still a race it looks like. I've made it harder to hit, but not eliminated it. I need to think of a way to get it for good.
Flags: needinfo?(docfaraday)
MozReview-Commit-ID: J6TNJqGsBv4
Attachment #8851650 - Flags: review?(nfroyd)
Part 3 was written to address the crashes you're seeing over in bug 1352047.
Flags: needinfo?(nfroyd)
Comment on attachment 8851650 [details] [diff] [review]
Part 3: Don't break the nsTimer/nsTimerImpl cycle during Fire

Review of attachment 8851650 [details] [diff] [review]:
-----------------------------------------------------------------

Apologies for taking so long to review this.

::: xpcom/threads/nsTimerImpl.cpp
@@ +326,5 @@
>    ++mGeneration;
>  
> +  if (mCallbackDuringFire.mType != Callback::Type::Unknown) {
> +    // Callback is firing; Cancel did not work!
> +    return NS_ERROR_FAILURE;

I am not comfortable with this; there are a decent number of places that check for timer cancellation failure, and we've never tested those failure codepaths.  A couple have things like MOZ_ALWAYS_SUCCEEDS attached to them as well.  Maybe if we returned something successful, but bogus (NS_SUCCESS_LOSS_OF_INSIGNIFICANT_DATA?) that we could explicitly check for in Release(), that would be better.
Attachment #8851650 - Flags: review?(nfroyd) → review-
MozReview-Commit-ID: J6TNJqGsBv4
Attachment #8851650 - Attachment is obsolete: true
Comment on attachment 8853053 [details] [diff] [review]
Part 3: Don't break the nsTimer/nsTimerImpl cycle during Fire

Review of attachment 8853053 [details] [diff] [review]:
-----------------------------------------------------------------

https://treeherder.mozilla.org/#/jobs?repo=try&revision=be4781c70e1c16c471055176af7c77e45926aaf5&selectedJob=87064301
Attachment #8853053 - Flags: review?(nfroyd)
Comment on attachment 8853053 [details] [diff] [review]
Part 3: Don't break the nsTimer/nsTimerImpl cycle during Fire

Review of attachment 8853053 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/nsTimerImpl.cpp
@@ +311,5 @@
>    return InitCommon(aDelay, aType);
>  }
>  
>  nsresult
> +nsTimerImpl::CancelFailIfFiring()

This works too, thank you.

@@ +335,5 @@
> +
> +nsresult
> +nsTimerImpl::Cancel()
> +{
> +  (void)CancelFailIfFiring();

WDYT about adding a brief comment here, like "// Nobody expects timer canceling to actually fail, so suppress errors." ?
Attachment #8853053 - Flags: review?(nfroyd) → review+
Flags: needinfo?(nfroyd)
MozReview-Commit-ID: J6TNJqGsBv4
Attachment #8853053 - Attachment is obsolete: true
Comment on attachment 8854137 [details] [diff] [review]
Part 3: Don't break the nsTimer/nsTimerImpl cycle during Fire

Review of attachment 8854137 [details] [diff] [review]:
-----------------------------------------------------------------

Decided to rename and change signature here, since the name was misleading.
Attachment #8854137 - Flags: review?(nfroyd)
Comment on attachment 8854137 [details] [diff] [review]
Part 3: Don't break the nsTimer/nsTimerImpl cycle during Fire

Review of attachment 8854137 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/nsTimerImpl.h
@@ +49,5 @@
>    static nsresult Startup();
>    static void Shutdown();
>  
>    void SetDelayInternal(uint32_t aDelay, TimeStamp aBase = TimeStamp::Now());
> +  bool CancelCheckIfFiring();

This reads oddly to me.  How about "AttemptCancel()", with a documentation string describing what the return value indicates?
Attachment #8854137 - Flags: review?(nfroyd) → review+
I'm not going to try to get part 3 in 53. I don't want to rush this that much.
(In reply to Nathan Froyd [:froydnj] from comment #41)
> Comment on attachment 8854137 [details] [diff] [review]
> Part 3: Don't break the nsTimer/nsTimerImpl cycle during Fire
> 
> Review of attachment 8854137 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: xpcom/threads/nsTimerImpl.h
> @@ +49,5 @@
> >    static nsresult Startup();
> >    static void Shutdown();
> >  
> >    void SetDelayInternal(uint32_t aDelay, TimeStamp aBase = TimeStamp::Now());
> > +  bool CancelCheckIfFiring();
> 
> This reads oddly to me.  How about "AttemptCancel()", with a documentation
> string describing what the return value indicates?

A return of true doesn't indicate that the Cancel failed, though. This is literally just Cancel, but it happens to return a bool indicating whether the timer was firing or not.
(In reply to Byron Campen [:bwc] from comment #42)
> I'm not going to try to get part 3 in 53. I don't want to rush this that
> much.

If you want parts 1 & 2 for 53, they need to be nominated now.  Otherwise, we need to wontfix the entire bug for 53.  Thanks
Flags: needinfo?(docfaraday)
Yeah, let's just hold off on 53.
Flags: needinfo?(docfaraday)
clearing tracking for 53 since we've WONTFIXed it
Comment on attachment 8854137 [details] [diff] [review]
Part 3: Don't break the nsTimer/nsTimerImpl cycle during Fire

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

  It would be really hard, I think.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

  Not exactly.

Which older supported branches are affected by this flaw?

  This question is kinda muddy; the timer stuff has been messed up for a long time, and when trying to squeeze the bugs out they manifest in different ways. It would be worth landing this on 54.

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

  These should be easy.

How likely is this patch to cause regressions; how much testing does it need?

  It could cause leaks I suppose, but I think it should be fine.
Attachment #8854137 - Flags: sec-approval?
Comment on attachment 8854137 [details] [diff] [review]
Part 3: Don't break the nsTimer/nsTimerImpl cycle during Fire

sec-approval=dveditz

Since we're holding off on 53 (a good idea) we need to wait on landing on ESR-52 until after merge day, but you can land on aurora 54 now. a=dveditz for aurora
Attachment #8854137 - Flags: approval-mozilla-aurora+
Attachment #8854137 - Flags: sec-approval? → sec-approval+
This needs part 3 to land still?
Flags: needinfo?(docfaraday)
Yes. I will be landing this in a few minutes.
https://hg.mozilla.org/integration/mozilla-inbound/rev/721db1be4f27c3d1eec87b40a342f88cfc433716
Bug 1339588 - Part 3: Don't break the nsTimer/nsTimerImpl cycle during Fire. r=froydnj, a=dveditz
Flags: needinfo?(docfaraday)
Byron -- Is this ready to be uplifted to ESR52?  (I'm getting regular, automatic emails from the relman tracking bot since it's being tracked for ESR52.)
Flags: needinfo?(docfaraday)
Sadly, it looks like we're still seeing crashes in crash-stats for this stuff. Back to the drawing board I guess...
Flags: needinfo?(docfaraday)
Whiteboard: [post-critsmash-triage]
Resetting "fixed" flags and re-opening since comment 56 since this isn't fixed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Byron, did you have a chance to investigate this bug again?
Flags: needinfo?(docfaraday)
Byron, did you have a chance to investigate this bug again?
No. I am neck-deep in a major rework of the webrtc signaling code.
Flags: needinfo?(docfaraday)
I suggest Byron continue work on the signaling code, which is important and a large effort, and we see who else can take another look at this, with Byron as a resource to discuss things with.  This is very touchy, cross-thread and performance-important code, so a new set of eyes on this might help.

A current search shows a fair number of problems (3-400/week, ignoring the top hit which is all ff33 and before):
https://crash-stats.mozilla.com/search/?signature=~nsTimer&date=%3E%3D2017-06-08T16%3A15%3A28.000Z&date=%3C2017-06-15T16%3A15%3A28.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Many of these may be misuses of data outside of timer itself, such as two threads using obj->mTimer without a lock, and one of them nulling it out.  Due to cache synchronization, timing, etc that could cause either a null-deref or a UAF (since the nulling might not be visible in the other, or might already be cached in a register).

Finding a way to catch (in debug or otherwise) bad uses of timers would help.

Also: billm/etc are changing the MainThread model to a pseudo-coroutine setup (N threads that are "MainThread" and never run at the same time, but all process the virtual event queue for mainthread).  See Bug 1365102 for recent changes to nsTimerImpl.cpp, for example.  However, I'm concerned if this works perfectly with nsTimerImpl::SetTarget() -- the targets fed in need to be the virtual nsIEventTargets, not the real thread eventtargets, and we need to ensure this doesn't violate any assumptions in the timer code.  Bill, can you check this?
Flags: needinfo?(wmccloskey)
(In reply to Randell Jesup [:jesup] from comment #61)
> Also: billm/etc are changing the MainThread model to a pseudo-coroutine
> setup (N threads that are "MainThread" and never run at the same time, but
> all process the virtual event queue for mainthread).  See Bug 1365102 for
> recent changes to nsTimerImpl.cpp, for example.  However, I'm concerned if
> this works perfectly with nsTimerImpl::SetTarget() -- the targets fed in
> need to be the virtual nsIEventTargets, not the real thread eventtargets,
> and we need to ensure this doesn't violate any assumptions in the timer
> code.  Bill, can you check this?

I don't think anything can go wrong here regardless of which kind of target you use. The only thing we do with the target is to call Dispatch and IsOnCurrentThread. Those methods will work regardless of whether you're using a cooperative thread as the event target or the original main thread.

I've considered requiring, for consistency, that the cooperative threads never have methods like Dispatch called on them. That way people always have to use the original main thread. However, I think that's a separate matter from this crash.
Flags: needinfo?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #62)
> I don't think anything can go wrong here regardless of which kind of target
> you use. The only thing we do with the target is to call Dispatch and
> IsOnCurrentThread. Those methods will work regardless of whether you're
> using a cooperative thread as the event target or the original main thread.

My concern was that the Timer will Dispatch to a specific cooperative thread, and not the meta-thread nsIEventTarget.  If that's ok, then callers to this may be fine.  And this was just a note based on this bug causing me to look at the code; I don't think any of the cooperative threading changes have anything to do with this bug.
At leat some of these crashes are UAFs even in recent versions.
Status: REOPENED → NEW
Target Milestone: mozilla55 → ---
Track 56+/57+ as sec-high.
Byron -- Do you have any new ideas for this?  If not, I think it makes sense for you to unassign yourself from this bug and for us to ask froydnj to find a new owner.  Thanks.
Flags: needinfo?(docfaraday)
Yeah, I simply don't have time to chase this right now.
Assignee: docfaraday → nobody
Flags: needinfo?(docfaraday) → needinfo?(nfroyd)
Wontfix for 56 at this point. We could still take a patch for 57.
Nathan, who could be assigned this bug?
(In reply to Stephanie Ouillon [:arroway] from comment #69)
> Nathan, who could be assigned this bug?

I'm not sure if there's as much left to do here; adjusting the query in comment 61 for the last month:

https://crash-stats.mozilla.com/search/?signature=~nsTimer&date=%3E%3D2017-09-04T14%3A07%3A35.000Z&date=%3C2017-10-04T14%3A07%3A35.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

and browsing through the top ~10 results indicates that we have very few (~10-20) crashes on 56/57/58.  So maybe this is fixed?  Byron landed a patch during the...57 cycle, I think? that might have addressed some of these.
Flags: needinfo?(nfroyd)
I don't see any obvious UAFs in 57, a very small handful in 56, and more in 55 and earlier. It might make more sense to mark this one fixed (since we did reduce the volume -- something got better) and start and new bug.
Status: NEW → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
What if anything do we want to do for ESR52 here?
Assignee: nobody → docfaraday
Target Milestone: mozilla54 → mozilla55
I don't think this work will apply to ESR.
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.