Closed
Bug 1392705
Opened 8 years ago
Closed 8 years ago
Intermittent autophone-s1s2 | application crashed [@ libc.so + 0x49da0]
Categories
(Firefox for Android Graveyard :: General, defect, P5)
Firefox for Android Graveyard
General
Tracking
(fennec+, firefox55 unaffected, firefox56 fixed, firefox57 fixed)
RESOLVED
FIXED
Firefox 57
Tracking | Status | |
---|---|---|
fennec | + | --- |
firefox55 | --- | unaffected |
firefox56 | --- | fixed |
firefox57 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Assigned: rbarker)
References
(Blocks 1 open bug)
Details
(Keywords: crash, intermittent-failure, Whiteboard: [stockwell fixed:product])
Crash Data
Attachments
(3 files, 3 obsolete files)
Updated•8 years ago
|
tracking-fennec: --- → ?
tracking-fennec: ? → +
Comment 1•8 years ago
|
||
20 failures in <4 days:
https://brasstacks.mozilla.com/orangefactor/index.html?display=Bug&bugid=1392705
:snorp, we should get someone to look at this.
Flags: needinfo?(snorp)
Whiteboard: [stockwell needswork]
Comment hidden (Intermittent Failures Robot) |
Randall this looks similar to the other bug you are looking at.
Assignee: nobody → rbarker
Flags: needinfo?(snorp)
Assignee | ||
Comment 5•8 years ago
|
||
Assignee | ||
Comment 6•8 years ago
|
||
I've attached a speculative fix since I can't reproduce the issue locally. What would be a good way to test this on autophone?
Flags: needinfo?(bob)
Comment 7•8 years ago
|
||
First, do a try build with try: -b do -p android-api-15 -u autophone-s1s2 -t none
That will run tests on the production system for the devices nexus-4-13, nexus-5-6, nexus-6p-11, nexus-9-2, pixel-10.
With the low orange factor of ~0.03 I would expect that you probably won't see this particular failure with one run even without the patch.
Need Info me when your build is complete and I will manually trigger the test in stages (so as to not over load the system) until we've run it 50 times. The 'staging' system has 3 nexus 5 and 2 pixels we can also use though it does not respond to try builds directly so I'll manually trigger the tests there. Its results will be reported to treeherder.allizom.org.
Flags: needinfo?(bob)
Assignee | ||
Comment 8•8 years ago
|
||
(In reply to Bob Clary [:bc:] from comment #7)
> First, do a try build with try: -b do -p android-api-15 -u autophone-s1s2 -t
> none
>
> That will run tests on the production system for the devices nexus-4-13,
> nexus-5-6, nexus-6p-11, nexus-9-2, pixel-10.
>
> With the low orange factor of ~0.03 I would expect that you probably won't
> see this particular failure with one run even without the patch.
>
> Need Info me when your build is complete and I will manually trigger the
> test in stages (so as to not over load the system) until we've run it 50
> times. The 'staging' system has 3 nexus 5 and 2 pixels we can also use
> though it does not respond to try builds directly so I'll manually trigger
> the tests there. Its results will be reported to treeherder.allizom.org.
I must be doing something wrong. When I try and start a try build it fails:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc60adb7421ce0c7f5f64153e44f587ddb8d5485&selectedJob=126818151
Assignee | ||
Comment 9•8 years ago
|
||
Do I need permission to run autophone? I can't get it to start.
Flags: needinfo?(bob)
Comment 10•8 years ago
|
||
Looks like try chooser needs its platforms updated. try this:
try: -b do -p android-api-16 -u autophone-s1s2 -t none
Flags: needinfo?(bob)
Assignee | ||
Comment 11•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a325e75ac50a6c1d000fea61d7cb8855aaf58c16
Looks like the build is finished.
Flags: needinfo?(bob)
Comment 12•8 years ago
|
||
I need to add android-api-16 to the supported platforms in Bug 1394960. I'll submit the try runs when I deploy that.
Comment 13•8 years ago
|
||
Ok. I've submitted one iteration to autophone-{1,2,3} which you can see at your original try run and one iteration to autophone-4 which you can see at https://treeherder.allizom.org/#/jobs?repo=try&revision=a325e75ac50a6c1d000fea61d7cb8855aaf58c16&exclusion_profile=false&group_state=expanded
I'll check back later after this has completed and submit moar.
Comment 14•8 years ago
|
||
Looks like this particular crash is fixed with this patch.
Flags: needinfo?(bob)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 16•8 years ago
|
||
New fix try run on autophone looks promising:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=342144f6989c773470b787cb6c77425648de4ac4
Probably needs more runs to be more confident.
Flags: needinfo?(bob)
Assignee | ||
Updated•8 years ago
|
Attachment #8902017 -
Attachment is obsolete: true
Comment 17•8 years ago
|
||
I've kicked off some try runs on production and staging. See staging here:
https://treeherder.allizom.org/#/jobs?repo=try&revision=342144f6989c773470b787cb6c77425648de4ac4&group_state=expanded&exclusion_profile=false
Looks like Bug 1397035 / Bug 1396804 crash [@ mozilla::URLPreloader::WriteCache] is standing up as our new crash.
Assignee | ||
Comment 18•8 years ago
|
||
This my best guess as to what it going on in this crash. Due to the indeterminate nature of Gecko shutdown, the ref count of the nsWindow on Android would sometime go to zero before the XPCOM shutdown observer was called in nsBaseWindow which is where the compositor thread IPC is shutdown. If nsBaseWindow::Shutdown does not get called, then the compositor thread IPC is shutdown in the nsBaseWindow destructor. Unfortunately while the nsWindow is being deleted, it can be accessed in the compositor thread and cause a crash in LayerManagerComposite::RenderToPresentationSurface. By having the WidgetShutdownObserver maintain a reference to the nsBaseWindow, it ensures that the XPCOM shutdown is observed before the nsWindow is deleted.
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 24•8 years ago
|
||
A couple of tests remain, but I think this is good to go from my pov.
Flags: needinfo?(bob)
Comment 25•8 years ago
|
||
mozreview-review |
Comment on attachment 8905191 [details]
Bug 1392705 - part 3: Ensure mWidget is valid in AndroidCompositorWidget
https://reviewboard.mozilla.org/r/176974/#review182088
Is it possible for `mWidget` to become null? I don't see it being set anywhere other than the constructor.
Comment 26•8 years ago
|
||
mozreview-review |
Comment on attachment 8905192 [details]
Bug 1392705 - part 3: Call nsBaseWidget::DestroyLayerManager() in nsWindow::Destroy to ensure IPC is not shutdown in the destructor for Android
https://reviewboard.mozilla.org/r/176976/#review182092
Seems like this would keep alive every `nsBaseWidget` ever created until shutdown?
Assignee | ||
Comment 27•8 years ago
|
||
mozreview-review-reply |
Comment on attachment 8905191 [details]
Bug 1392705 - part 3: Ensure mWidget is valid in AndroidCompositorWidget
https://reviewboard.mozilla.org/r/176974/#review182088
You are right. I'll drop this patch. I got confused with all the different things I tried. There is a Compositor::DetachWidget() but that sets the actual CompositorWidget to nullptr.
Assignee | ||
Comment 28•8 years ago
|
||
mozreview-review-reply |
Comment on attachment 8905192 [details]
Bug 1392705 - part 3: Call nsBaseWidget::DestroyLayerManager() in nsWindow::Destroy to ensure IPC is not shutdown in the destructor for Android
https://reviewboard.mozilla.org/r/176976/#review182092
Yes, and that would probably be a really bad idea for a GeckoView world. I was thinking of current Fennec where we only create two nested nsWindows. I'm open to alternate solutions. I couldn't find any sane place to make sure nsBaseWidget::Shutdown is called before the nsWindow is deleted when the ref count goes to one.
Assignee | ||
Comment 29•8 years ago
|
||
mozreview-review-reply |
Comment on attachment 8905192 [details]
Bug 1392705 - part 3: Call nsBaseWidget::DestroyLayerManager() in nsWindow::Destroy to ensure IPC is not shutdown in the destructor for Android
https://reviewboard.mozilla.org/r/176976/#review182092
I meant when the ref count goes to zero.
Assignee | ||
Comment 30•8 years ago
|
||
Comment on attachment 8905191 [details]
Bug 1392705 - part 3: Ensure mWidget is valid in AndroidCompositorWidget
This patch doesn't fix any issues.
Attachment #8905191 -
Attachment is obsolete: true
Attachment #8905191 -
Flags: review?(nchen)
Assignee | ||
Comment 31•8 years ago
|
||
The original fix isn't viable. This autophone try push uses an alternate solution and looks promising.
Bob, can you work your autophone magic to run this solution through multiple tests?
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f110fdc0045ffe405ccc297c7b7b8ecc2c40e566
Flags: needinfo?(bob)
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 36•8 years ago
|
||
Done. Don't forget to include excluded jobs to see everything:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f110fdc0045ffe405ccc297c7b7b8ecc2c40e566&exclusion_profile=false
https://treeherder.allizom.org/#/jobs?repo=try&revision=f110fdc0045ffe405ccc297c7b7b8ecc2c40e566&exclusion_profile=false
Flags: needinfo?(bob)
Comment 37•8 years ago
|
||
rbarker: You're looking good from my pov. I guess we're just waiting on reviews?
Comment 38•8 years ago
|
||
I think you just have to call `DestroyLayerManager` from `nsWindow::Destroy`. That way you don't even need to change `nsBaseWidget`.
Assignee | ||
Comment 39•8 years ago
|
||
(In reply to Jim Chen [:jchen] [:darchons] from comment #38)
> I think you just have to call `DestroyLayerManager` from
> `nsWindow::Destroy`. That way you don't even need to change `nsBaseWidget`.
I tried nsBaseWidget::DestroyLayerManager in nsWindow::Destroy() but it did not work:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a3da630a2282a84718f409ff99d965c6e2aef707
but calling nsBaseWidget::DestroyCompositor() in nsWindow::Destroy() looks promising:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d1244dc9fd8fbefd9d37a18ef045cbac705281ff
But will need to run it through it's paces to be sure.
Bob, sorry to ask again. Is it documented some where how to correctly run this through autophone with out being disruptive and included the disabled tests? I've just been copy/pasting the direction posted in comment 10 to start the build. I should probably know the right way to do this my self so I don't have to ask for help after every revision.
Flags: needinfo?(bob)
Comment 40•8 years ago
|
||
The accepted way is just to click retrigger from taskcluster but that is too much of a pita for triggering any insignificant number of tries. The method I use is a command line which is easy to run but requires vpn access to the hosts. It's no big deal to run. If I'm ever not available, jmaher or gbrown can do it as well.
Just to be clear: retrigger the second d1244dc9fd8fbefd9d37a18ef045cbac705281ff build?
Flags: needinfo?(bob) → needinfo?(rbarker)
Assignee | ||
Comment 41•8 years ago
|
||
(In reply to Bob Clary [:bc:] from comment #40)
> The accepted way is just to click retrigger from taskcluster but that is too
> much of a pita for triggering any insignificant number of tries. The method
> I use is a command line which is easy to run but requires vpn access to the
> hosts. It's no big deal to run. If I'm ever not available, jmaher or gbrown
> can do it as well.
>
> Just to be clear: retrigger the second
> d1244dc9fd8fbefd9d37a18ef045cbac705281ff build?
Sorry, yes the second try. Hopefully this is the last one. Thanks.
Flags: needinfo?(rbarker)
Comment 42•8 years ago
|
||
Comment 43•8 years ago
|
||
(In reply to Randall Barker [:rbarker] from comment #39)
> (In reply to Jim Chen [:jchen] [:darchons] from comment #38)
> > I think you just have to call `DestroyLayerManager` from
> > `nsWindow::Destroy`. That way you don't even need to change `nsBaseWidget`.
>
> I tried nsBaseWidget::DestroyLayerManager in nsWindow::Destroy() but it did
> not work:
>
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=a3da630a2282a84718f409ff99d965c6e2aef707
>
> but calling nsBaseWidget::DestroyCompositor() in nsWindow::Destroy() looks
> promising:
>
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=d1244dc9fd8fbefd9d37a18ef045cbac705281ff
>
> But will need to run it through it's paces to be sure.
Nice! Could it be the crash from that first link is due to part 1 of the patches? If you look at the crash stack, it looks like we end up processing compositor IPC messages mid-shutdown due the the sync dispatch.
Comment hidden (Intermittent Failures Robot) |
Comment 45•8 years ago
|
||
mozreview-review |
Comment on attachment 8905190 [details]
Bug 1392705 - part 2: Ensure LayerManagerComposite::RenderToPresentationSurface checks the compositor widget before using it
https://reviewboard.mozilla.org/r/176972/#review183278
This seems harmless enough.
Attachment #8905190 -
Flags: review?(bugmail) → review+
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Updated•8 years ago
|
Attachment #8905193 -
Attachment is obsolete: true
Attachment #8905193 -
Flags: review?(nchen)
Assignee | ||
Comment 49•8 years ago
|
||
(In reply to Jim Chen [:jchen] [:darchons] from comment #43)
> Nice! Could it be the crash from that first link is due to part 1 of the
> patches? If you look at the crash stack, it looks like we end up processing
> compositor IPC messages mid-shutdown due the the sync dispatch.
I think under normal operation, the LayerManager is deleted after the compositor thread has been halted. I think it looks something like this:
1) XPCOM shutdown triggers nsBaseWidget::Shutdown which calls nsBaseWidget::DestroyCompositor()
2) nsBaseWidget::~nsBaseWidget() calls nsBaseWidget::DestroyLayerManager() which works because nsBaseWidget::DestroyCompositor() has already been invoked.
So maybe nsBaseWidget::DestroyLayerManager() should be called first in nsBaseWidget::DestroyLayerManager() and then the layer manger freed? It isn't clear to me since shutdown is so indeterminate on Android.
The first part the patch seems to also fix the issues found in Bug 1394788 and Bug 1394428 so I'm pretty sure it is correct.
Comment 50•8 years ago
|
||
mozreview-review |
Comment on attachment 8905192 [details]
Bug 1392705 - part 3: Call nsBaseWidget::DestroyLayerManager() in nsWindow::Destroy to ensure IPC is not shutdown in the destructor for Android
https://reviewboard.mozilla.org/r/176976/#review183798
Attachment #8905192 -
Flags: review?(nchen) → review+
Comment 51•8 years ago
|
||
Hello Randall: If this will address the crashes in Bug 1394788, can we please land it? Thanks.
Flags: needinfo?(rbarker)
Comment 52•8 years ago
|
||
Comment on attachment 8905189 [details]
Bug 1392705 - part 1: Make UiCompositorControllerChild::Destroy synchronous
Hopefully David can take this while Bill's out.
We're concerned here because this bug seems to be the root of bug 1394788's chain.
Attachment #8905189 -
Flags: review?(wmccloskey) → review?(dvander)
Assignee | ||
Comment 53•8 years ago
|
||
(In reply to Marcia Knous [:marcia - use ni] from comment #51)
> Hello Randall: If this will address the crashes in Bug 1394788, can we
> please land it? Thanks.
As soon as part 1 of the patch gets reviewed I'll land it.
Flags: needinfo?(rbarker)
Comment 55•8 years ago
|
||
David, this is an extremely frequent crash in the wild and in automation. Can you please prioritize this review or redirect it to someone else with more cycles?
Flags: needinfo?(dvander)
Comment hidden (Intermittent Failures Robot) |
![]() |
||
Comment 57•8 years ago
|
||
mozreview-review |
Comment on attachment 8905189 [details]
Bug 1392705 - part 1: Make UiCompositorControllerChild::Destroy synchronous
https://reviewboard.mozilla.org/r/176970/#review186046
Attachment #8905189 -
Flags: review?(dvander) → review+
Flags: needinfo?(dvander)
Comment 58•8 years ago
|
||
why is :dvander's review not allowing autoland to land the commits
Comment 59•8 years ago
|
||
ok, RyanVM landed this already
Comment 60•8 years ago
|
||
bugherder landing |
Comment 61•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/706c7baf6e58
https://hg.mozilla.org/mozilla-central/rev/07751873d49f
https://hg.mozilla.org/mozilla-central/rev/68c789e404f0
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox57:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → Firefox 57
![]() |
||
Updated•8 years ago
|
Whiteboard: [stockwell needswork] → [stockwell fixed:product]
Comment 84•8 years ago
|
||
rbarker, snorp: What are the chances of getting this on 56?
Flags: needinfo?(snorp)
Flags: needinfo?(rbarker)
I think it's pretty late. Unless we can prove that this is dramatically affecting users, I don't think it will make it.
Flags: needinfo?(snorp)
Comment 86•8 years ago
|
||
Bug 1394788 is being tracked for 56 due to the high crash rate. Does that help?
Flags: needinfo?(snorp)
Do we know for sure that this patch fixes that one too? If we do, then yes, I think that is a good justification.
Flags: needinfo?(snorp)
Comment 88•8 years ago
|
||
https://crash-stats.mozilla.com/signature/?signature=MessageLoop%3A%3APostTask_Helper isn't showing any crashes with today's build ID (20170919...), although looking at the timing of past reports, it's probably still a little too early to tell for sure.
Updated•8 years ago
|
status-firefox55:
--- → unaffected
status-firefox56:
--- → affected
Assignee | ||
Comment 89•8 years ago
|
||
Comment on attachment 8905189 [details]
Bug 1392705 - part 1: Make UiCompositorControllerChild::Destroy synchronous
Approval Request Comment
[Feature/Bug causing the regression]: Dynamic Toolbar v3
[User impact if declined]: crash on shutdown
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: The test affect by this bug have been re-enabled and seem to be green now.
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: Just the three patches attached to this bug.
[Is the change risky?]: probably not.
[Why is the change risky/not risky?]: This bug is only hit on shutdown. This patch only changes the order of shutdown operations to ensure some happen before others.
[String changes made/needed]: none
Flags: needinfo?(rbarker)
Attachment #8905189 -
Flags: approval-mozilla-beta?
Comment 90•8 years ago
|
||
Comment on attachment 8905189 [details]
Bug 1392705 - part 1: Make UiCompositorControllerChild::Destroy synchronous
56 is on release now.
Attachment #8905189 -
Flags: approval-mozilla-beta? → approval-mozilla-release?
Comment 91•8 years ago
|
||
Comment on attachment 8905189 [details]
Bug 1392705 - part 1: Make UiCompositorControllerChild::Destroy synchronous
Fix a shutdown crash, will help with some test failures, let's take this for RC2
Attachment #8905189 -
Flags: approval-mozilla-release? → approval-mozilla-release+
Comment 92•8 years ago
|
||
bugherder uplift |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•5 years ago
|
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•