Crash in mozilla::layers::ClientLayerManager::DidComposite

RESOLVED FIXED in Firefox 64

Status

()

--
critical
RESOLVED FIXED
2 months ago
5 hours ago

People

(Reporter: marcia, Assigned: mstange)

Tracking

({crash, regression})

Trunk
mozilla65
Unspecified
macOS
crash, regression
Points:
---

Firefox Tracking Flags

(firefox-esr60 unaffected, firefox63 wontfix, firefox64 fixed, firefox65 fixed)

Details

(crash signature)

Attachments

(1 attachment)

This bug was filed from the Socorro interface and is
report bp-7b1f595b-406e-4433-874a-16c9c0181126.
=============================================================

Mac crash at the top of Mac specific 63.0.3 crashes: https://bit.ly/2zqbupe. This crash seems to have started in 63, but wasn't seen in the 63 betas. Comments mention:

*Crashed in the middle of banking app Peoples.com 
*Was attempting to purchase something on Ancestry web site and clicked button for PayPal to checkout. The link opened a page which instantly disappeared as a pop-up presented in the foreground. The field of the pop-up was blank and the indicator suggested that it was loading. Suddenly, the browser crashed. I need to make that purchase before the deadline. 

Some of the other URLs seem to be people trying to pay bills. 

Top 10 frames of crashing thread:

0 XUL mozilla::layers::ClientLayerManager::DidComposite gfx/layers/client/ClientLayerManager.cpp:507
1 XUL mozilla::layers::CompositorBridgeChild::RecvDidComposite gfx/layers/ipc/CompositorBridgeChild.cpp:539
2 XUL NS_IsMainThread xpcom/threads/nsThreadManager.cpp:47
3  @0x11eddf447 
4 XUL mozilla::layers::PCompositorBridgeChild::OnMessageReceived ipc/ipdl/PCompositorBridgeChild.cpp:1436
5 XUL mozilla::ipc::MessageChannel::DispatchAsyncMessage ipc/glue/MessageChannel.cpp:2248
6 XUL mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:2175
7 XUL mozilla::ipc::MessageChannel::MessageTask::Run ipc/glue/MessageChannel.cpp:2045
8 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1161
9 libsystem_pthread.dylib libsystem_pthread.dylib@0x1c6f 

=============================================================
status-firefox-esr60: --- → affected
Jesse can you help find someone to investigate this crash? It seems to be particularly bothersome to Mac users.
status-firefox63: affected → wontfix
status-firefox64: affected → fix-optional
status-firefox65: --- → ?
Flags: needinfo?(jbonisteel)
Since this seems to have been introduced in 63.0.3 was bug 1503424 seems like the most likely cause.
Blocks: 1503424

Updated

2 months ago
Assignee: nobody → mstange
Flags: needinfo?(jbonisteel)

Updated

2 months ago
Flags: needinfo?(mstange)
(Assignee)

Comment 3

2 months ago
I can't make heads or tails of this. Bug 1503424 added a null check for something that we thought was only null on macOS 10.10 prerelease versions. So if there had been any regressions from it, I would have expected those regressions to only happen on machines with those 10.10 prereleases, and only for users that would otherwise have crashed on startup. But that's clearly not the case - these new crashes happen on seemingly all versions of macOS.

Here's the version breakdown:
OS X 10.9   93
OS X 10.10  176
OS X 10.11  67
OS X 10.12  137
OS X 10.13  71
OS X 10.14  69

There's a disproportionately large number of crashes on 10.10. So there clearly is *some* correlation with 10.10 but I don't understand what it could be.

As for the actual crash in this bug, it's here:

>  if (aTransactionId.IsValid()) {
>    nsIWidgetListener *listener = mWidget->GetWidgetListener();
>    if (listener) {
>      listener->DidCompositeWindow(aTransactionId, aCompositeStart, aCompositeEnd);
>    }
>    listener = mWidget->GetAttachedWidgetListener();

mWidget seems to be null. But it seems to not have been null during the call to mWidget->GetWidgetListener() a few lines further up, otherwise we should have crashed there. So maybe the call to listener->DidCompositeWindow is causing mWidget to be nulled out? But if that's the case, how is that related to the null check in the initialization method of ChildView that bug 1503424 added?
Flags: needinfo?(mstange)
(Assignee)

Comment 4

2 months ago
The code quoted above got compiled to the following (I'm showing Hopper's decompiled code for it):

>  if (r14 != 0x0) {
>    rdi = *(r13 + 0xb8);
>    rax = *rdi;
>    rax = (*(rax + 0x58))(rdi);
>    if (rax != 0x0) {
>      (*(*rax + 0x90))(rax, r14, r12, r15);
>    }
>    rdi = *(r13 + 0xb8);
>    rax = *rdi;            // <-- crash happens here
>    ...

r15 is the transaction ID, r13 is the this pointer to the ClientLayerManager instance, rdi is the pointer to the widget, rax is the pointer to the widget's vtable (and briefly the pointer to the widget listener). So this confirms the suspicion that the call to listener->DidCompositeWindow nulls out mWidget; the second time we try to get the widget's vtable, we dereference a null widget pointer.
(Assignee)

Comment 5

2 months ago
listener->DidCompositeWindow calls nsView::DidCompositeWindow, which puts a script blocker on the stack and then calls rootContext->NotifyDidPaintForSubtree, which puts DelayedFireDOMPaintEvents into the script blocker. So then those can execute JS when the script blocker is popped off the stack at the end of nsView::DidCompositeWindow. And if the JS handlers close windows, I assume this would destroy the widget and null it out on the ClientLayerManager.

I still can't see a relation to 1503424 though, and I don't see anything in the 63.0.1 -> 63.0.3 diff that indicates that it might change when windows are closed or how MozAfterPaint events are processed.
During triage we discussed the fact that there were a set of 52 crashes in 62.0.3: https://bit.ly/2E4gUdr. However, those crashes don't seem to be the same kind of URLs that are in the ones in 63, and they are not skewed to 10.10.
(Assignee)

Comment 7

2 months ago
Created attachment 9028453 [details]
Bug 1510058 - Work around mWidget becoming null during DidComposite, and add a diagnostic assert that will hopefully root out the culprit. r?jrmuizel

Comment 8

2 months ago
Pushed by mstange@themasta.com:
https://hg.mozilla.org/integration/autoland/rev/daf1ffb72989
Work around mWidget becoming null during DidComposite, and add a diagnostic assert that will hopefully root out the culprit. r=jrmuizel

Comment 9

2 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/daf1ffb72989
Status: NEW → RESOLVED
Last Resolved: 2 months ago
status-firefox65: ? → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Crash volume for 63.0.3 looks high enough that we may want to keep this on the radar for possible Fx64 dot release ride-along status.
status-firefox-esr60: affected → unaffected
(Assignee)

Comment 11

a month ago
Comment on attachment 9028453 [details]
Bug 1510058 - Work around mWidget becoming null during DidComposite, and add a diagnostic assert that will hopefully root out the culprit. r?jrmuizel

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: None

User impact if declined: Crashes on macOS, possibly when popup windows close. No steps to reproduce unfortunately, and no clearly identified regressor.

Is this code covered by automated tests?: No

Has the fix been verified in Nightly?: No

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): The patch just adds an additional null check.

String changes made/needed:
Attachment #9028453 - Flags: approval-mozilla-release?
Attachment #9028453 - Flags: approval-mozilla-beta?
Comment on attachment 9028453 [details]
Bug 1510058 - Work around mWidget becoming null during DidComposite, and add a diagnostic assert that will hopefully root out the culprit. r?jrmuizel

This is already on Beta (65).
Attachment #9028453 - Flags: approval-mozilla-beta?

Comment on attachment 9028453 [details]
Bug 1510058 - Work around mWidget becoming null during DidComposite, and add a diagnostic assert that will hopefully root out the culprit. r?jrmuizel

avoiding a null ptr crash on macos; approved for 64.0.2

Attachment #9028453 - Flags: approval-mozilla-release? → approval-mozilla-release+

Comment 14

15 days ago
bugherderuplift
status-firefox64: fix-optional → fixed

Comment 15

4 days ago

After 64.0.2 first days crash problem resolved but today affected again with MacOs Mojave 10.14.2

(Assignee)

Comment 16

5 hours ago

(In reply to YavuzGm from comment #15)

After 64.0.2 first days crash problem resolved but today affected again with MacOs Mojave 10.14.2

What's the crash ID of your crash? I think it might have been bug 1515183 and not this bug.

Looking at the remaining crashes in 64.0.2, there's only been 4 in ClientLayerManager::DidComposite, and they look very different and are likely a separate bug. So I think this fix was successful!

https://crash-stats.mozilla.com/search/?build_id=20190108160530&release_channel=release&signature=%3Dmozilla%3A%3Alayers%3A%3AClientLayerManager%3A%3ADidComposite&product=Firefox&version=64.0.2&date=%3E%3D2018-10-29T20%3A00%3A00.000Z&date=%3C2019-01-23T13%3A12%3A46.000Z&_sort=-date&_facets=install_time&_facets=version&_facets=address&_facets=moz_crash_reason&_facets=reason&_facets=build_id&_facets=platform_pretty_version&_facets=signature&_facets=useragent_locale&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

You need to log in before you can comment on or make changes to this bug.