<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 6

•

7 years ago

This is hurting me a lot. I get the crash window too often these days. It's intereupting me when I'm using other apps. I guess I should consider migrating to Firefox stable until this gets fixed. Hope this gets fixed soon.

Julien Cristau [:jcristau]

Comment 7

•

7 years ago

Tentatively marking statuses, since crash-stats doesn't seem to have much before 64. [Tracking Requested - why for this release]: Android topcrash, seemingly very reproducible, and people are filing duplicate bugs (e.g. bug 1496556).

status-firefox62: --- → wontfix

status-firefox63: --- → wontfix

status-firefox64: --- → affected

status-firefox-esr60: --- → unaffected

tracking-firefox64: --- → ?

Comment 8

•

7 years ago

Mike can you take a look at these?

tracking-firefox64: ? → +

Flags: needinfo?(mh+mozilla)

Assignee

Comment 9

•

7 years ago

Because of optimization, the stack traces are essentially useless because we have no clue what code path leads from nsThread::ProcessNextEvent to nsTimerImpl::Release, and the latter is called with, it looks like, a buggy `this`. So I backed out bug 1480006, let's see how it goes.

Flags: needinfo?(mh+mozilla)

Updated

•

7 years ago

Blocks: android-lto

Comment 10

•

7 years ago

Looks like Mike's 2018-10-15 backout in comment 9 worked. The last Android crashes with this signature are from build ID 20181015100128: https://crash-stats.mozilla.com/search/?signature=~nsTimerImpl%3A%3ARelease&product=FennecAndroid&date=%3E%3D2018-04-24T16%3A50%3A01.000Z&date=%3C2018-10-24T16%3A50%3A01.000Z&_sort=-date&_facets=signature&_facets=cpu_arch&_facets=version&_facets=platform_pretty_version&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-build_id This crash blocks enabling LTO on Android (bug 1480006). What can we do to diagnose or work around this crash? Is this more likely an existing timer lifetime bug or a miscompilation of nsTimerImpl::Release with LTO? About 20% of recent crash reports' URLs point to Wikipedia, e.g. https://en.m.wikipedia.org/wiki/CRISPR.

Hardware: Unspecified → ARM

Summary: Crash in nsTimerImpl::Release → Crash in nsTimerImpl::Release on Android

Whiteboard: [geckoview:p2]

Assignee

Comment 11

•

7 years ago

Now that we're past the merge point and clang was upgraded to version 7, I think we can reasonably give LTO another try, and if the crashes still happen, we'd need to find a way to get better stack traces (disabling tail call optimizations maybe?). It would be great if we could reproduce the problem on automation, though.

Comment 12

•

7 years ago

Heads up: Mike re-enabled Android LTO in 65 Nightly (but 1480006 comment 15), so this nsTimerImpl::Release crash might return.

status-firefox64: affected → disabled

Assignee

Comment 13

•

7 years ago

New reports are coming in with the same crash after the relanding of bug 1480006, but they have one more frame that previous reports didn't have. https://crash-stats.mozilla.com/report/index/bf338227-d3c6-4551-932a-5008d0181025 0 libxul.so nsTimerImpl::Release() android-ndk/sources/cxx-stl/llvm-libc++/include/atomic:1023 1 libxul.so TimerThread::Run() mfbt/RefPtr.h:211 2 libxul.so nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1245 3 libxul.so NS_ProcessNextEvent(nsIThread*, bool) xpcom/threads/nsThreadUtils.cpp:530 The instruction pointer in stack frame #1 is 0x371dd5 within libxul.so, which for the corresponding nightly is, as per the stack walker output, in an inlined function from RefPtr. From the crash reporter symbols, the surrounding code comes from: TimerThread.cpp:187 TimerThread.cpp:0 RefPtr.h:265 RefPtr.h:211 (assuming there's no jumping around ; I didn't look at the disassembly) TimerThread.cpp:187 is nsTimerEvent::SetTimer. Let's see how much further we can go with the data from those crashes, but in the meanwhile, let's backout bug 1480006 again.

Comment 14

•

7 years ago

Anthony says this bug is in Mike's queue so I will assign it to him for now. P3 because, while the GeckoView team is eager to ship LTO, this crash is not currently affecting users.

Assignee: nobody → mh+mozilla

status-firefox65: --- → disabled

Priority: -- → P3

Assignee

Comment 15

•

7 years ago

Who was able to reproduce this? I would like a few builds to be tested: - baseline with LTO, which should reproduce the crash: https://queue.taskcluster.net/v1/task/RaTeEhkfSA-oKT6tVPAqJg/runs/0/artifacts/public/build/target.apk - tentative workaround #1: https://queue.taskcluster.net/v1/task/cMsmNaB_RvqFU5o3hmt1Ag/runs/0/artifacts/public/build/target.apk - tentative workaround #2: https://queue.taskcluster.net/v1/task/EABRQHTgTkS61iiT_5NWvg/runs/0/artifacts/public/build/target.apk I'd like to know if either or both tentative workarounds work, assuming the baseline does reproduce the crash.

Flags: needinfo?(cpeterson)

Comment 16

•

7 years ago

(In reply to Mike Hommey [:glandium] from comment #15) > Who was able to reproduce this? I don't know if anyone at Mozilla reproduced this crash. I can email the people who included their email address in their crash reports and ask them to test these bugs.

Assignee

Comment 17

•

7 years ago

We have at least someone CCed on this bug that had the problem. Kaartic, can you check comment 15?

Flags: needinfo?(kaartic)

Comment 18

•

7 years ago

(In reply to Mike Hommey [:glandium] from comment #17) > We have at least someone CCed on this bug that had the problem. Kaartic, can > you check comment 15? Yes, I could help. So, I just have to use that apk and verify whether the tentative fixes works. Before which I have to ensure I'm able to reproduce the crash using the apk in the first link. Am I right? Just for your information, I don't have specific steps to reproduce this. My reproduction recipe is just to install the app and use it as usual. If the app is infected I see a crash at random times (even when I'm not using the app). My expected *maximum* time within which a random a random crash would occur in an affected app is 1 week (sooner is quite likely). So it would take me about 3 weeks to complete this test.

Flags: needinfo?(kaartic)

Comment 19

•

7 years ago

I just now tried installing the apk in the first link. But I couldn't. I got an error stating the 'App could not be installed'. Not sure what's wrong. BTW, do the above APKs update the Firefox Nightly I use or they install as a separate app? I would like the former.

Comment 20

•

7 years ago

(In reply to Kaartic Sivaraam from comment #18) > Yes, I could help. So, I just have to use that apk and verify whether the > tentative fixes works. Before which I have to ensure I'm able to reproduce > the crash using the apk in the first link. Am I right? That's correct. > Just for your information, I don't have specific steps to reproduce this. My > reproduction recipe is just to install the app and use it as usual. If the > app is infected I see a crash at random times (even when I'm not using the > app). My expected *maximum* time within which a random a random crash would > occur in an affected app is 1 week (sooner is quite likely). So it would > take me about 3 weeks to complete this test. I don't know of any specific steps to reproduce either. Many of the crash reports had Wikipedia URLs, so maybe try browsing a lot of Wikipedia pages. (In reply to Kaartic Sivaraam from comment #19) > I just now tried installing the apk in the first link. But I couldn't. I got > an error stating the 'App could not be installed'. Not sure what's wrong. > > BTW, do the above APKs update the Firefox Nightly I use or they install as a > separate app? I would like the former. The APKs will install using the same "Firefox Nightly" app name, though the APK installer might complain that Firefox Nightly is already installed. In that case, you might need to uninstall Firefox Nightly first. I'm not sure if your bookmarks/etc will be preserved if you uninstall and reinstall (unless you use Firefox Sync to back them up).

Flags: needinfo?(cpeterson)

•

7 years ago

Mike, now that 66 Nightly has begun, can we try landing your un-inlining patches to see if they make the SetTimer crashes "go away" or at least give us better stack traces?

status-firefox66: --- → ?

Flags: needinfo?(mh+mozilla)

Assignee

Comment 28

•

7 years ago

(In reply to Nathan Froyd [:froydnj] from comment #26) (...) > and again, no release there. So maybe this is ICF in action? But even if > it is, where in the world are we calling into an atomic Release() function > that triggers the atomic<T>::fetch_sub on a bad pointer? I guess we could > be triggering nsTimerEvent::Release(), somehow, and that pointer is > completely bogus because the underlying timer event allocator got its pages > corrupted via buffer overruns, or something? There's an implicit Release on the last line of PostTimerEvent, when `timer` is destructed (and was not forgotten) and its refcount is 1 (so, the refcount of the passed-in already_AddRefed was 1)

Flags: needinfo?(mh+mozilla) → needinfo?(nfroyd)

Comment 29

•

7 years ago

(In reply to Mike Hommey [:glandium] from comment #28) > (In reply to Nathan Froyd [:froydnj] from comment #26) > (...) > > and again, no release there. So maybe this is ICF in action? But even if > > it is, where in the world are we calling into an atomic Release() function > > that triggers the atomic<T>::fetch_sub on a bad pointer? I guess we could > > be triggering nsTimerEvent::Release(), somehow, and that pointer is > > completely bogus because the underlying timer event allocator got its pages > > corrupted via buffer overruns, or something? > > There's an implicit Release on the last line of PostTimerEvent, when `timer` > is destructed (and was not forgotten) and its refcount is 1 (so, the > refcount of the passed-in already_AddRefed was 1) Sure, but: a) A lot (all?) of the crash reports are happening deeper on the stack than PostTimerEvent, so I don't think that matters; and b) the contract of PostTimerEvent is that we either own the passed-in nsTimerImpl, or we return it on failure. Failure points: https://searchfox.org/mozilla-central/rev/fd62b95c187a40b328d9e7fd9d848833a6942b57/xpcom/threads/TimerThread.cpp#703-707 https://searchfox.org/mozilla-central/rev/fd62b95c187a40b328d9e7fd9d848833a6942b57/xpcom/threads/TimerThread.cpp#717-720 https://searchfox.org/mozilla-central/rev/fd62b95c187a40b328d9e7fd9d848833a6942b57/xpcom/threads/TimerThread.cpp#745-749 The only successful return is at: https://searchfox.org/mozilla-central/rev/fd62b95c187a40b328d9e7fd9d848833a6942b57/xpcom/threads/TimerThread.cpp#751 and if we reach there, the timer has been passed into the timer event we're dispatching: https://searchfox.org/mozilla-central/rev/fd62b95c187a40b328d9e7fd9d848833a6942b57/xpcom/threads/TimerThread.cpp#735 and we've successfully dispatched the event that now owns the nsTimerImpl: https://searchfox.org/mozilla-central/rev/fd62b95c187a40b328d9e7fd9d848833a6942b57/xpcom/threads/TimerThread.cpp#742 so I have a hard time seeing how we're going to Release(). Oh, oh, so we Dispatch() the timer event, but we don't actually pass the ownership into Dispatch(). So now the target thread has a ref, and we have a ref on the timer thread. And if the target thread preempts our timer thread, and releases its reference to the timer event, we're going to release the last reference to the timer event on the timer thread, and with it the owning ref to the timer impl. And then under certain circumstances, that could be the last ref to the timer impl...and we crash? That at least provides a plausible path to how Release() gets called, but I'd have to think a little bit harder about why we would crash there.

Flags: needinfo?(nfroyd)

Comment 30

•

7 years ago

Horrible thought: the code in PostTimerEvent is: RefPtr<nsTimerEvent> event = ...; ... event->SetTimer(...); and we apparently crash with a Release() call inside SetTimer() itself. Expanding the above code a little bit, what we actually have is something like: nsTimerEvent* e = allocate memory; e->mTimer.mRawPtr = nullptr; e->mGeneration = 0; e->AddRef(); ... // The internal bits touching mTimer in SetTimer look like: if (e->mTimer.mRawPtr) { e->mTimer.mRawPtr->Release(); } e->mTimer.mRawPtr = timer.take(); // Don't need to AddRef ... What if we're somehow calling Release() in the above code, because the nullptr initialization didn't actually happen (memory bitflips?) or because the compiler elided it somehow, which seems somewhat plausible with LTO enabled? I don't think that's a *great* theory, because this crash is intermittent, and in at least the "LTO removing code it shouldn't" scenario, we should be seeing the crash a *lot* more often. But that's how I can see Release() getting called from inside calls to SetTimer, which is what some stacks are telling us...

Comment 31

•

7 years ago

Assuming comment 30 is semi-plausible, I've written patches in bug 1513615 to remove the (pointless) check and Release() call. It's possible those patches will make the crash go away, but it's also possible they'll just move it to some other inscrutable place.

•

7 years ago

(In reply to Nathan Froyd [:froydnj] from comment #31)

Assuming comment 30 is semi-plausible, I've written patches in bug 1513615
to remove the (pointless) check and Release() call. It's possible those
patches will make the crash go away, but it's also possible they'll just
move it to some other inscrutable place.

Nathan, now that you've landed your timer cleanup (bug 1513615), can we try re-enabling Android LTO to see if that fixed this nsTimerImpl::Release() crash? Should we let your timer cleanup bake a few days on Nightly first?

status-geckoview64: --- → wontfix

status-geckoview65: --- → wontfix

Flags: needinfo?(nfroyd)

Comment 35

•

7 years ago

(In reply to Chris Peterson [:cpeterson] from comment #34)

(In reply to Nathan Froyd [:froydnj] from comment #31)

Assuming comment 30 is semi-plausible, I've written patches in bug 1513615
to remove the (pointless) check and Release() call. It's possible those
patches will make the crash go away, but it's also possible they'll just
move it to some other inscrutable place.

Nathan, now that you've landed your timer cleanup (bug 1513615), can we try re-enabling Android LTO to see if that fixed this nsTimerImpl::Release() crash? Should we let your timer cleanup bake a few days on Nightly first?

Sure, we can re-enable Android LTO. I have a small preference to give it a day or two of bake time.

Flags: needinfo?(nfroyd)