Closed Bug 1392705 Opened 2 years ago Closed 2 years ago

Intermittent autophone-s1s2 | application crashed [@ libc.so + 0x49da0]

Categories

(Firefox for Android :: General, defect, P5, critical)

defect

Tracking

()

RESOLVED FIXED
Firefox 57
Tracking Status
fennec + ---
firefox55 --- unaffected
firefox56 --- fixed
firefox57 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: rbarker)

References

(Blocks 1 open bug)

Details

(Keywords: crash, intermittent-failure, Whiteboard: [stockwell fixed:product])

Crash Data

Attachments

(3 files, 3 obsolete files)

tracking-fennec: --- → ?
tracking-fennec: ? → +
20 failures in <4 days:
https://brasstacks.mozilla.com/orangefactor/index.html?display=Bug&bugid=1392705

:snorp, we should get someone to look at this.
Flags: needinfo?(snorp)
Whiteboard: [stockwell needswork]
Duplicate of this bug: 1393997
Randall this looks similar to the other bug you are looking at.
Assignee: nobody → rbarker
Flags: needinfo?(snorp)
I've attached a speculative fix since I can't reproduce the issue locally. What would be a good way to test this on autophone?
Flags: needinfo?(bob)
First, do a try build with try: -b do -p android-api-15 -u autophone-s1s2 -t none

That will run tests on the production system for the devices nexus-4-13, nexus-5-6, nexus-6p-11, nexus-9-2, pixel-10. 

With the low orange factor of ~0.03 I would expect that you probably won't see this particular failure with one run even without the patch.

Need Info me when your build is complete and I will manually trigger the test in stages (so as to not over load the system) until we've run it 50 times. The 'staging' system has 3 nexus 5 and 2 pixels we can also use though it does not respond to try builds directly so I'll manually trigger the tests there. Its results will be reported to treeherder.allizom.org.
Flags: needinfo?(bob)
(In reply to Bob Clary [:bc:] from comment #7)
> First, do a try build with try: -b do -p android-api-15 -u autophone-s1s2 -t
> none
> 
> That will run tests on the production system for the devices nexus-4-13,
> nexus-5-6, nexus-6p-11, nexus-9-2, pixel-10. 
> 
> With the low orange factor of ~0.03 I would expect that you probably won't
> see this particular failure with one run even without the patch.
> 
> Need Info me when your build is complete and I will manually trigger the
> test in stages (so as to not over load the system) until we've run it 50
> times. The 'staging' system has 3 nexus 5 and 2 pixels we can also use
> though it does not respond to try builds directly so I'll manually trigger
> the tests there. Its results will be reported to treeherder.allizom.org.

I must be doing something wrong. When I try and start a try build it fails:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc60adb7421ce0c7f5f64153e44f587ddb8d5485&selectedJob=126818151
Do I need permission to run autophone? I can't get it to start.
Flags: needinfo?(bob)
Looks like try chooser needs its platforms updated. try this:

try: -b do -p android-api-16 -u autophone-s1s2 -t none
Flags: needinfo?(bob)
I need to add android-api-16 to the supported platforms in Bug 1394960. I'll submit the try runs when I deploy that.
Ok. I've submitted one iteration to autophone-{1,2,3} which you can see at your original try run and one iteration to autophone-4 which you can see at https://treeherder.allizom.org/#/jobs?repo=try&revision=a325e75ac50a6c1d000fea61d7cb8855aaf58c16&exclusion_profile=false&group_state=expanded

I'll check back later after this has completed and submit moar.
Looks like this particular crash is fixed with this patch.
Flags: needinfo?(bob)
New fix try run on autophone looks promising:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=342144f6989c773470b787cb6c77425648de4ac4

Probably needs more runs to be more confident.
Flags: needinfo?(bob)
Attachment #8902017 - Attachment is obsolete: true
I've kicked off some try runs on production and staging. See staging here:
https://treeherder.allizom.org/#/jobs?repo=try&revision=342144f6989c773470b787cb6c77425648de4ac4&group_state=expanded&exclusion_profile=false

Looks like Bug 1397035 / Bug 1396804 crash [@ mozilla::URLPreloader::WriteCache] is standing up as our new crash.
This my best guess as to what it going on in this crash. Due to the indeterminate nature of Gecko shutdown, the ref count of the nsWindow on Android would sometime go to zero before the XPCOM shutdown observer was called in nsBaseWindow which is where the compositor thread IPC is shutdown. If nsBaseWindow::Shutdown does not get called, then the compositor thread IPC is shutdown in the nsBaseWindow destructor. Unfortunately while the nsWindow is being deleted, it can be accessed in the compositor thread and cause a crash in LayerManagerComposite::RenderToPresentationSurface. By having the WidgetShutdownObserver maintain a reference to the nsBaseWindow, it ensures that the XPCOM shutdown is observed before the nsWindow is deleted.
A couple of tests remain, but I think this is good to go from my pov.
Flags: needinfo?(bob)
Comment on attachment 8905191 [details]
Bug 1392705 - part 3: Ensure mWidget is valid in AndroidCompositorWidget

https://reviewboard.mozilla.org/r/176974/#review182088

Is it possible for `mWidget` to become null? I don't see it being set anywhere other than the constructor.
Comment on attachment 8905192 [details]
Bug 1392705 - part 3: Call nsBaseWidget::DestroyLayerManager() in nsWindow::Destroy to ensure IPC is not shutdown in the destructor for Android

https://reviewboard.mozilla.org/r/176976/#review182092

Seems like this would keep alive every `nsBaseWidget` ever created until shutdown?
Comment on attachment 8905191 [details]
Bug 1392705 - part 3: Ensure mWidget is valid in AndroidCompositorWidget

https://reviewboard.mozilla.org/r/176974/#review182088

You are right. I'll drop this patch. I got confused with all the different things I tried. There is a Compositor::DetachWidget() but that sets the actual CompositorWidget to nullptr.
Comment on attachment 8905192 [details]
Bug 1392705 - part 3: Call nsBaseWidget::DestroyLayerManager() in nsWindow::Destroy to ensure IPC is not shutdown in the destructor for Android

https://reviewboard.mozilla.org/r/176976/#review182092

Yes, and that would probably be a really bad idea for a GeckoView world. I was thinking of current Fennec where we only create two nested nsWindows. I'm open to alternate solutions. I couldn't find any sane place to make sure nsBaseWidget::Shutdown is called before the nsWindow is deleted when the ref count goes to one.
Comment on attachment 8905192 [details]
Bug 1392705 - part 3: Call nsBaseWidget::DestroyLayerManager() in nsWindow::Destroy to ensure IPC is not shutdown in the destructor for Android

https://reviewboard.mozilla.org/r/176976/#review182092

I meant when the ref count goes to zero.
Comment on attachment 8905191 [details]
Bug 1392705 - part 3: Ensure mWidget is valid in AndroidCompositorWidget

This patch doesn't fix any issues.
Attachment #8905191 - Attachment is obsolete: true
Attachment #8905191 - Flags: review?(nchen)
The original fix isn't viable. This autophone try push uses an alternate solution and looks promising. 

Bob, can you work your autophone magic to run this solution through multiple tests?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f110fdc0045ffe405ccc297c7b7b8ecc2c40e566
Flags: needinfo?(bob)
rbarker: You're looking good from my pov. I guess we're just waiting on reviews?
I think you just have to call `DestroyLayerManager` from `nsWindow::Destroy`. That way you don't even need to change `nsBaseWidget`.
(In reply to Jim Chen [:jchen] [:darchons] from comment #38)
> I think you just have to call `DestroyLayerManager` from
> `nsWindow::Destroy`. That way you don't even need to change `nsBaseWidget`.

I tried nsBaseWidget::DestroyLayerManager in nsWindow::Destroy() but it did not work:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a3da630a2282a84718f409ff99d965c6e2aef707

but calling nsBaseWidget::DestroyCompositor() in nsWindow::Destroy() looks promising:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d1244dc9fd8fbefd9d37a18ef045cbac705281ff

But will need to run it through it's paces to be sure.

Bob, sorry to ask again. Is it documented some where how to correctly run this through autophone with out being disruptive and included the disabled tests? I've just been copy/pasting the direction posted in comment 10 to start the build. I should probably know the right way to do this my self so I don't have to ask for help after every revision.
Flags: needinfo?(bob)
The accepted way is just to click retrigger from taskcluster but that is too much of a pita for triggering any insignificant number of tries. The method I use is a command line which is easy to run but requires vpn access to the hosts. It's no big deal to run. If I'm ever not available, jmaher or gbrown can do it as well.

Just to be clear: retrigger the second d1244dc9fd8fbefd9d37a18ef045cbac705281ff build?
Flags: needinfo?(bob) → needinfo?(rbarker)
(In reply to Bob Clary [:bc:] from comment #40)
> The accepted way is just to click retrigger from taskcluster but that is too
> much of a pita for triggering any insignificant number of tries. The method
> I use is a command line which is easy to run but requires vpn access to the
> hosts. It's no big deal to run. If I'm ever not available, jmaher or gbrown
> can do it as well.
> 
> Just to be clear: retrigger the second
> d1244dc9fd8fbefd9d37a18ef045cbac705281ff build?

Sorry, yes the second try. Hopefully this is the last one. Thanks.
Flags: needinfo?(rbarker)
(In reply to Randall Barker [:rbarker] from comment #39)
> (In reply to Jim Chen [:jchen] [:darchons] from comment #38)
> > I think you just have to call `DestroyLayerManager` from
> > `nsWindow::Destroy`. That way you don't even need to change `nsBaseWidget`.
> 
> I tried nsBaseWidget::DestroyLayerManager in nsWindow::Destroy() but it did
> not work:
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=a3da630a2282a84718f409ff99d965c6e2aef707
> 
> but calling nsBaseWidget::DestroyCompositor() in nsWindow::Destroy() looks
> promising:
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=d1244dc9fd8fbefd9d37a18ef045cbac705281ff
> 
> But will need to run it through it's paces to be sure.

Nice! Could it be the crash from that first link is due to part 1 of the patches? If you look at the crash stack, it looks like we end up processing compositor IPC messages mid-shutdown due the the sync dispatch.
Comment on attachment 8905190 [details]
Bug 1392705 - part 2: Ensure LayerManagerComposite::RenderToPresentationSurface checks the compositor widget before using it

https://reviewboard.mozilla.org/r/176972/#review183278

This seems harmless enough.
Attachment #8905190 - Flags: review?(bugmail) → review+
Attachment #8905193 - Attachment is obsolete: true
Attachment #8905193 - Flags: review?(nchen)
(In reply to Jim Chen [:jchen] [:darchons] from comment #43)
> Nice! Could it be the crash from that first link is due to part 1 of the
> patches? If you look at the crash stack, it looks like we end up processing
> compositor IPC messages mid-shutdown due the the sync dispatch.

I think under normal operation, the LayerManager is deleted after the compositor thread has been halted. I think it looks something like this:

1) XPCOM shutdown triggers nsBaseWidget::Shutdown which calls nsBaseWidget::DestroyCompositor()
2) nsBaseWidget::~nsBaseWidget() calls nsBaseWidget::DestroyLayerManager() which works because nsBaseWidget::DestroyCompositor() has already been invoked.

So maybe nsBaseWidget::DestroyLayerManager() should be called first in nsBaseWidget::DestroyLayerManager() and then the layer manger freed? It isn't clear to me since shutdown is so indeterminate on Android.

The first part the patch seems to also fix the issues found in Bug 1394788 and Bug 1394428 so I'm pretty sure it is correct.
Comment on attachment 8905192 [details]
Bug 1392705 - part 3: Call nsBaseWidget::DestroyLayerManager() in nsWindow::Destroy to ensure IPC is not shutdown in the destructor for Android

https://reviewboard.mozilla.org/r/176976/#review183798
Attachment #8905192 - Flags: review?(nchen) → review+
Hello Randall: If this will address the crashes in Bug 1394788, can we please land it? Thanks.
Flags: needinfo?(rbarker)
See Also: → 1394788
Comment on attachment 8905189 [details]
Bug 1392705 - part 1: Make UiCompositorControllerChild::Destroy synchronous

Hopefully David can take this while Bill's out.

We're concerned here because this bug seems to be the root of bug 1394788's chain.
Attachment #8905189 - Flags: review?(wmccloskey) → review?(dvander)
(In reply to Marcia Knous [:marcia - use ni] from comment #51)
> Hello Randall: If this will address the crashes in Bug 1394788, can we
> please land it? Thanks.

As soon as part 1 of the patch gets reviewed I'll land it.
Flags: needinfo?(rbarker)
Duplicate of this bug: 1398705
David, this is an extremely frequent crash in the wild and in automation. Can you please prioritize this review or redirect it to someone else with more cycles?
Flags: needinfo?(dvander)
Comment on attachment 8905189 [details]
Bug 1392705 - part 1: Make UiCompositorControllerChild::Destroy synchronous

https://reviewboard.mozilla.org/r/176970/#review186046
Attachment #8905189 - Flags: review?(dvander) → review+
why is :dvander's review not allowing autoland to land the commits
ok, RyanVM landed this already
See Also: → 1394428
Duplicate of this bug: 1394428
Duplicate of this bug: 1395159
Duplicate of this bug: 1395156
Duplicate of this bug: 1395040
Duplicate of this bug: 1394665
Duplicate of this bug: 1394613
Duplicate of this bug: 1394292
Duplicate of this bug: 1394288
Duplicate of this bug: 1394279
Duplicate of this bug: 1394260
Duplicate of this bug: 1394258
Duplicate of this bug: 1394157
Duplicate of this bug: 1394155
Duplicate of this bug: 1395797
Duplicate of this bug: 1395796
Duplicate of this bug: 1394154
Duplicate of this bug: 1394129
Duplicate of this bug: 1393995
Duplicate of this bug: 1391435
Duplicate of this bug: 1388970
Duplicate of this bug: 1365409
Duplicate of this bug: 1359247
Whiteboard: [stockwell needswork] → [stockwell fixed:product]
rbarker, snorp: What are the chances of getting this on 56?
Flags: needinfo?(snorp)
Flags: needinfo?(rbarker)
I think it's pretty late. Unless we can prove that this is dramatically affecting users, I don't think it will make it.
Flags: needinfo?(snorp)
Bug 1394788 is being tracked for 56 due to the high crash rate. Does that help?
Flags: needinfo?(snorp)
Do we know for sure that this patch fixes that one too? If we do, then yes, I think that is a good justification.
Flags: needinfo?(snorp)
https://crash-stats.mozilla.com/signature/?signature=MessageLoop%3A%3APostTask_Helper isn't showing any crashes with today's build ID (20170919...), although looking at the timing of past reports, it's probably still a little too early to tell for sure.
Comment on attachment 8905189 [details]
Bug 1392705 - part 1: Make UiCompositorControllerChild::Destroy synchronous

Approval Request Comment
[Feature/Bug causing the regression]: Dynamic Toolbar v3
[User impact if declined]: crash on shutdown
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: The test affect by this bug have been re-enabled and seem to be green now.
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: Just the three patches attached to this bug.
[Is the change risky?]: probably not.
[Why is the change risky/not risky?]: This bug is only hit on shutdown. This patch only changes the order of shutdown operations to ensure some happen before others.
[String changes made/needed]: none
Flags: needinfo?(rbarker)
Attachment #8905189 - Flags: approval-mozilla-beta?
Comment on attachment 8905189 [details]
Bug 1392705 - part 1: Make UiCompositorControllerChild::Destroy synchronous

56 is on release now.
Attachment #8905189 - Flags: approval-mozilla-beta? → approval-mozilla-release?
Comment on attachment 8905189 [details]
Bug 1392705 - part 1: Make UiCompositorControllerChild::Destroy synchronous

Fix a shutdown crash, will help with some test failures, let's take this for RC2
Attachment #8905189 - Flags: approval-mozilla-release? → approval-mozilla-release+
Duplicate of this bug: 1392690
Duplicate of this bug: 1381547
Duplicate of this bug: 1359148
Duplicate of this bug: 1392941
Duplicate of this bug: 1392693
You need to log in before you can comment on or make changes to this bug.