Closed Bug 1489953 Opened Last year Closed 7 months ago

Crash in nsTimerImpl::Release on Android

Categories

(Core :: XPCOM, defect, P3, critical)

ARM
Android
defect

Tracking

()

RESOLVED DUPLICATE of bug 1513615
Tracking Status
geckoview64 --- wontfix
geckoview65 --- wontfix
firefox-esr60 --- unaffected
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 + disabled
firefox65 --- disabled
firefox66 --- ?

People

(Reporter: gsvelto, Assigned: glandium)

References

Details

(Keywords: crash, topcrash, Whiteboard: [geckoview:p2])

Crash Data

This bug was filed from the Socorro interface and is
report bp-a3f48224-069f-49b9-b121-896700180910.
=============================================================

Top 10 frames of crashing thread:

0 libxul.so nsTimerImpl::Release android-ndk/sources/cxx-stl/llvm-libc++/include/atomic:1023
1 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1161
2 libxul.so NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:519
3 libxul.so mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:334
4 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:325
5 libxul.so nsThread::ThreadFunc xpcom/threads/nsThread.cpp:464
6 libnss3.so _pt_root nsprpub/pr/src/pthreads/ptthread.c:201
7 libc.so libc.so@0x47f47 
8 libc.so libc.so@0x47f2f 
9 libc.so libc.so@0x1afcd 

=============================================================
There seems to be a recent spike of this in nightly, starting from version 20180905100117.
These crashes are all on Android, and bug 1480006 is in the regression range. Maybe this is a signature change?
This is currently the #2 overall crash on Fennec nightly. One comment "Just went to HNN's just-redone site (http://www.hawaiinewsnow.com/). Crashes are happening consistently in every nightly build.
Keywords: topcrash
Duplicate of this bug: 1496556
This is hurting me a lot. I get the crash window too often these days. It's intereupting me when I'm using other apps. I guess I should consider migrating to Firefox stable until this gets fixed. Hope this gets fixed soon.
Tentatively marking statuses, since crash-stats doesn't seem to have much before 64.

[Tracking Requested - why for this release]: Android topcrash, seemingly very reproducible, and people are filing duplicate bugs (e.g. bug 1496556).
Mike can you take a look at these?
Flags: needinfo?(mh+mozilla)
Because of optimization, the stack traces are essentially useless because we have no clue what code path leads from nsThread::ProcessNextEvent to nsTimerImpl::Release, and the latter is called with, it looks like, a buggy `this`.

So I backed out bug 1480006, let's see how it goes.
Flags: needinfo?(mh+mozilla)
Looks like Mike's 2018-10-15 backout in comment 9 worked. The last Android crashes with this signature are from build ID 20181015100128:

https://crash-stats.mozilla.com/search/?signature=~nsTimerImpl%3A%3ARelease&product=FennecAndroid&date=%3E%3D2018-04-24T16%3A50%3A01.000Z&date=%3C2018-10-24T16%3A50%3A01.000Z&_sort=-date&_facets=signature&_facets=cpu_arch&_facets=version&_facets=platform_pretty_version&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-build_id

This crash blocks enabling LTO on Android (bug 1480006). What can we do to diagnose or work around this crash? Is this more likely an existing timer lifetime bug or a miscompilation of nsTimerImpl::Release with LTO?

About 20% of recent crash reports' URLs point to Wikipedia, e.g. https://en.m.wikipedia.org/wiki/CRISPR.
Hardware: Unspecified → ARM
Summary: Crash in nsTimerImpl::Release → Crash in nsTimerImpl::Release on Android
Whiteboard: [geckoview:p2]
Now that we're past the merge point and clang was upgraded to version 7, I think we can reasonably give LTO another try, and if the crashes still happen, we'd need to find a way to get better stack traces (disabling tail call optimizations maybe?).
It would be great if we could reproduce the problem on automation, though.
Heads up: Mike re-enabled Android LTO in 65 Nightly (but 1480006 comment 15), so this nsTimerImpl::Release crash might return.
New reports are coming in with the same crash after the relanding of bug 1480006, but they have one more frame that previous reports didn't have.

https://crash-stats.mozilla.com/report/index/bf338227-d3c6-4551-932a-5008d0181025

0 	libxul.so 	nsTimerImpl::Release() 	android-ndk/sources/cxx-stl/llvm-libc++/include/atomic:1023
1 	libxul.so 	TimerThread::Run() 	mfbt/RefPtr.h:211
2 	libxul.so 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1245
3 	libxul.so 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/threads/nsThreadUtils.cpp:530

The instruction pointer in stack frame #1 is 0x371dd5 within libxul.so, which for the corresponding nightly is, as per the stack walker output, in an inlined function from RefPtr. From the crash reporter symbols, the surrounding code comes from:

TimerThread.cpp:187
TimerThread.cpp:0
RefPtr.h:265
RefPtr.h:211

(assuming there's no jumping around ; I didn't look at the disassembly)

TimerThread.cpp:187 is nsTimerEvent::SetTimer.

Let's see how much further we can go with the data from those crashes, but in the meanwhile, let's backout bug 1480006 again.
Anthony says this bug is in Mike's queue so I will assign it to him for now.

P3 because, while the GeckoView team is eager to ship LTO, this crash is not currently affecting users.
Assignee: nobody → mh+mozilla
Priority: -- → P3
Who was able to reproduce this?

I would like a few builds to be tested:
- baseline with LTO, which should reproduce the crash: https://queue.taskcluster.net/v1/task/RaTeEhkfSA-oKT6tVPAqJg/runs/0/artifacts/public/build/target.apk
- tentative workaround #1: https://queue.taskcluster.net/v1/task/cMsmNaB_RvqFU5o3hmt1Ag/runs/0/artifacts/public/build/target.apk
- tentative workaround #2: https://queue.taskcluster.net/v1/task/EABRQHTgTkS61iiT_5NWvg/runs/0/artifacts/public/build/target.apk

I'd like to know if either or both tentative workarounds work, assuming the baseline does reproduce the crash.
Flags: needinfo?(cpeterson)
(In reply to Mike Hommey [:glandium] from comment #15)
> Who was able to reproduce this?

I don't know if anyone at Mozilla reproduced this crash. I can email the people who included their email address in their crash reports and ask them to test these bugs.
We have at least someone CCed on this bug that had the problem. Kaartic, can you check comment 15?
Flags: needinfo?(kaartic)
(In reply to Mike Hommey [:glandium] from comment #17)
> We have at least someone CCed on this bug that had the problem. Kaartic, can
> you check comment 15?

Yes, I could help. So, I just have to use that apk and verify whether the tentative fixes works. Before which I have to ensure I'm able to reproduce the crash using the apk in the first link. Am I right?

Just for your information, I don't have specific steps to reproduce this. My reproduction recipe is just to install the app and use it as usual. If the app is infected I see a crash at random times (even when I'm not using the app). My expected *maximum* time within which a random a random crash would occur in an affected app is 1 week (sooner is quite likely). So it would take me about 3 weeks to complete this test.
Flags: needinfo?(kaartic)
I just now tried installing the apk in the first link. But I couldn't. I got an error stating the 'App could not be installed'. Not sure what's wrong.

BTW, do the above APKs update the Firefox Nightly I use or they install as a separate app? I would like the former.
(In reply to Kaartic Sivaraam from comment #18)
> Yes, I could help. So, I just have to use that apk and verify whether the
> tentative fixes works. Before which I have to ensure I'm able to reproduce
> the crash using the apk in the first link. Am I right?

That's correct.

> Just for your information, I don't have specific steps to reproduce this. My
> reproduction recipe is just to install the app and use it as usual. If the
> app is infected I see a crash at random times (even when I'm not using the
> app). My expected *maximum* time within which a random a random crash would
> occur in an affected app is 1 week (sooner is quite likely). So it would
> take me about 3 weeks to complete this test.

I don't know of any specific steps to reproduce either. Many of the crash reports had Wikipedia URLs, so maybe try browsing a lot of Wikipedia pages.

(In reply to Kaartic Sivaraam from comment #19)
> I just now tried installing the apk in the first link. But I couldn't. I got
> an error stating the 'App could not be installed'. Not sure what's wrong.
> 
> BTW, do the above APKs update the Firefox Nightly I use or they install as a
> separate app? I would like the former.

The APKs will install using the same "Firefox Nightly" app name, though the APK installer might complain that Firefox Nightly is already installed. In that case, you might need to uninstall Firefox Nightly first. I'm not sure if your bookmarks/etc will be preserved if you uninstall and reinstall (unless you use Firefox Sync to back them up).
Flags: needinfo?(cpeterson)
(In reply to Chris Peterson [:cpeterson] from comment #20)
> The APKs will install using the same "Firefox Nightly" app name, though the
> APK installer might complain that Firefox Nightly is already installed. In
> that case, you might need to uninstall Firefox Nightly first. I'm not sure
> if your bookmarks/etc will be preserved if you uninstall and reinstall
> (unless you use Firefox Sync to back them up).

I'm not able to install the APK. It seem I would have to uninstall and try installing the APK. I'm a little reluctant towards uninstalling as I have a lot of tabs open which I have revisit at some time on the future. I have Firefox Sync. Anyways to restore open tabs after I reinstall? I'm also ok with installing the debug APKs as a separate app, if it is possible.
I've have the three apks downloaded to my phone. 

Im currently testing baseline, but having trouble triggering bug. 

In terms of my experience there was a time when the bug would consistently cause nightly to crash
but it has not triggered in recent time. 

I will continue testing baseline to see if i can reproduce it.
So it turns out LTO only appears to make a pre-existing problem worse. As a matter of fact, we're getting crash reports with the same (actually more complete) stack trace on release, and to a lesser extent, on other platforms. It seems to be happening more frequently on quad-core CPUs on Android 4.x and Android 5.x.

Example: https://crash-stats.mozilla.com/report/index/da8cdd36-5c1b-46d3-b0dc-27b6a0181120
with the full stack trace being:
0 	libxul.so 	nsTimerImpl::Release() 	android-ndk/sources/cxx-stl/llvm-libc++/include/atomic:1023
1 	libxul.so 	TimerThread::PostTimerEvent(already_AddRefed<nsTimerImpl>) 	xpcom/threads/TimerThread.cpp:793
2 	libxul.so 	TimerThread::Run() 	xpcom/threads/TimerThread.cpp:467

which corresponds to what I could make out from the LTO crashes.

There are also crashes with TimerThread::Run at the top of the stack:
https://crash-stats.mozilla.com/signature/?_sort=-date&signature=TimerThread%3A%3ARun&date=%3E%3D2018-11-14T11%3A12%3A25.000Z&date=%3C2018-11-21T11%3A12%3A25.000Z
most of which are on other platforms, but they might as well have the same root cause.

This too looks like the same, but on Windows:
https://crash-stats.mozilla.com/report/index/aeedb966-1ce0-4594-908d-243ae0181115

Overall, it seems there's a race condition somewhere in the TimerThread code, leading to what looks like might be a UAF.
One thing worth noting is that the code calling PostTimerEvent says this:
// We are going to let the call to PostTimerEvent here handle the
// release of the timer so that we don't end up releasing the timer
// on the TimerThread instead of on the thread it targets.

... and all the crashes I looked at are on the Timer thread...
It feels like what's happening is that some user of nsITimer is releasing its timer too early, racing with PostTimerEvent. Nathan, do you have any thought?
Flags: needinfo?(nfroyd)
I have looked at this bug this morning and I don't get what's going on.  None of the crashes are during shutdown, so that eliminates a lot of the possible weirdness.  The crash addresses are all over the place (e.g. some in the low 128K of memory or so).

And where is Release() even getting called from, anyway?  The bit in TimerThread::Run() says:

          // We are going to let the call to PostTimerEvent here handle the
          // release of the timer so that we don't end up releasing the timer
          // on the TimerThread instead of on the thread it targets.
          timerRef = PostTimerEvent(timerRef.forget());

so no release there.  So we dig into PostTimerEvent and we find:

already_AddRefed<nsTimerImpl>
TimerThread::PostTimerEvent(already_AddRefed<nsTimerImpl> aTimerRef)
{
  mMonitor.AssertCurrentThreadOwns();

  RefPtr<nsTimerImpl> timer(aTimerRef);
  ...
  RefPtr<nsTimerEvent> event = new nsTimerEvent;
  if (!event) {
    return timer.forget();
  }
  ...
  event->SetTimer(timer.forget()); // CRASH

so no release there, either--at least not on nsTimerImpl.  And finally, nsTimerEvent::SetTimer() is just:

  void SetTimer(already_AddRefed<nsTimerImpl> aTimer)
  {
    mTimer = aTimer;
    mGeneration = mTimer->GetGeneration();
  }

and again, no release there.  So maybe this is ICF in action?  But even if it is, where in the world are we calling into an atomic Release() function that triggers the atomic<T>::fetch_sub on a bad pointer?  I guess we could be triggering nsTimerEvent::Release(), somehow, and that pointer is completely bogus because the underlying timer event allocator got its pages corrupted via buffer overruns, or something?

I guess we could forcibly un-inline SetTimer (or even AddRef/Release on some of the relevant classes in this file) to see if that added stacks to the crash reports.
Flags: needinfo?(nfroyd)
Mike, now that 66 Nightly has begun, can we try landing your un-inlining patches to see if they make the SetTimer crashes "go away" or at least give us better stack traces?
Flags: needinfo?(mh+mozilla)
(In reply to Nathan Froyd [:froydnj] from comment #26)
(...)
> and again, no release there.  So maybe this is ICF in action?  But even if
> it is, where in the world are we calling into an atomic Release() function
> that triggers the atomic<T>::fetch_sub on a bad pointer?  I guess we could
> be triggering nsTimerEvent::Release(), somehow, and that pointer is
> completely bogus because the underlying timer event allocator got its pages
> corrupted via buffer overruns, or something?

There's an implicit Release on the last line of PostTimerEvent, when `timer` is destructed (and was not forgotten) and its refcount is 1 (so, the refcount of the passed-in already_AddRefed was 1)
Flags: needinfo?(mh+mozilla) → needinfo?(nfroyd)
(In reply to Mike Hommey [:glandium] from comment #28)
> (In reply to Nathan Froyd [:froydnj] from comment #26)
> (...)
> > and again, no release there.  So maybe this is ICF in action?  But even if
> > it is, where in the world are we calling into an atomic Release() function
> > that triggers the atomic<T>::fetch_sub on a bad pointer?  I guess we could
> > be triggering nsTimerEvent::Release(), somehow, and that pointer is
> > completely bogus because the underlying timer event allocator got its pages
> > corrupted via buffer overruns, or something?
> 
> There's an implicit Release on the last line of PostTimerEvent, when `timer`
> is destructed (and was not forgotten) and its refcount is 1 (so, the
> refcount of the passed-in already_AddRefed was 1)

Sure, but:

a) A lot (all?) of the crash reports are happening deeper on the stack than PostTimerEvent, so I don't think that matters; and
b) the contract of PostTimerEvent is that we either own the passed-in nsTimerImpl, or we return it on failure.  Failure points:

https://searchfox.org/mozilla-central/rev/fd62b95c187a40b328d9e7fd9d848833a6942b57/xpcom/threads/TimerThread.cpp#703-707
https://searchfox.org/mozilla-central/rev/fd62b95c187a40b328d9e7fd9d848833a6942b57/xpcom/threads/TimerThread.cpp#717-720
https://searchfox.org/mozilla-central/rev/fd62b95c187a40b328d9e7fd9d848833a6942b57/xpcom/threads/TimerThread.cpp#745-749

The only successful return is at:

https://searchfox.org/mozilla-central/rev/fd62b95c187a40b328d9e7fd9d848833a6942b57/xpcom/threads/TimerThread.cpp#751

and if we reach there, the timer has been passed into the timer event we're dispatching:

https://searchfox.org/mozilla-central/rev/fd62b95c187a40b328d9e7fd9d848833a6942b57/xpcom/threads/TimerThread.cpp#735

and we've successfully dispatched the event that now owns the nsTimerImpl:

https://searchfox.org/mozilla-central/rev/fd62b95c187a40b328d9e7fd9d848833a6942b57/xpcom/threads/TimerThread.cpp#742

so I have a hard time seeing how we're going to Release().

Oh, oh, so we Dispatch() the timer event, but we don't actually pass the ownership into Dispatch().  So now the target thread has a ref, and we have a ref on the timer thread.  And if the target thread preempts our timer thread, and releases its reference to the timer event, we're going to release the last reference to the timer event on the timer thread, and with it the owning ref to the timer impl.  And then under certain circumstances, that could be the last ref to the timer impl...and we crash?

That at least provides a plausible path to how Release() gets called, but I'd have to think a little bit harder about why we would crash there.
Flags: needinfo?(nfroyd)
Horrible thought: the code in PostTimerEvent is:

  RefPtr<nsTimerEvent> event = ...;
  ...
  event->SetTimer(...);

and we apparently crash with a Release() call inside SetTimer() itself.  Expanding the above code a little bit, what we actually have is something like:

  nsTimerEvent* e = allocate memory;
  e->mTimer.mRawPtr = nullptr;
  e->mGeneration = 0;
  e->AddRef();
  ...
  // The internal bits touching mTimer in SetTimer look like:
  if (e->mTimer.mRawPtr) {
    e->mTimer.mRawPtr->Release();
  }
  e->mTimer.mRawPtr = timer.take(); // Don't need to AddRef
  ...

What if we're somehow calling Release() in the above code, because the nullptr initialization didn't actually happen (memory bitflips?) or because the compiler elided it somehow, which seems somewhat plausible with LTO enabled?  I don't think that's a *great* theory, because this crash is intermittent, and in at least the "LTO removing code it shouldn't" scenario, we should be seeing the crash a *lot* more often.  But that's how I can see Release() getting called from inside calls to SetTimer, which is what some stacks are telling us...
Assuming comment 30 is semi-plausible, I've written patches in bug 1513615 to remove the (pointless) check and Release() call.  It's possible those patches will make the crash go away, but it's also possible they'll just move it to some other inscrutable place.
See Also: → 1513615
Reminder: the crash *is* happening without LTO (see comment 23).
The compiler can be smart enough to see that the memory's zero'd in the constructor and constant-propagate that fact to the inlined RefPtr::operator=.  That being said, I don't see than being done in my Linux x86-64 Nightly, so maybe this optimization is more complicated than I thought, or the passes aren't getting run at the right place in the pipeline.

(In reply to Nathan Froyd [:froydnj] from comment #31)

Assuming comment 30 is semi-plausible, I've written patches in bug 1513615
to remove the (pointless) check and Release() call. It's possible those
patches will make the crash go away, but it's also possible they'll just
move it to some other inscrutable place.

Nathan, now that you've landed your timer cleanup (bug 1513615), can we try re-enabling Android LTO to see if that fixed this nsTimerImpl::Release() crash? Should we let your timer cleanup bake a few days on Nightly first?

Flags: needinfo?(nfroyd)

(In reply to Chris Peterson [:cpeterson] from comment #34)

(In reply to Nathan Froyd [:froydnj] from comment #31)

Assuming comment 30 is semi-plausible, I've written patches in bug 1513615
to remove the (pointless) check and Release() call. It's possible those
patches will make the crash go away, but it's also possible they'll just
move it to some other inscrutable place.

Nathan, now that you've landed your timer cleanup (bug 1513615), can we try re-enabling Android LTO to see if that fixed this nsTimerImpl::Release() crash? Should we let your timer cleanup bake a few days on Nightly first?

Sure, we can re-enable Android LTO. I have a small preference to give it a day or two of bake time.

Flags: needinfo?(nfroyd)

It looks like this was largely fixed bug 1513615. Looking some of the Firefox > 66 crashes, they look different.

Status: NEW → RESOLVED
Closed: 7 months ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1513615
You need to log in before you can comment on or make changes to this bug.