Closed Bug 1587794 Opened 5 years ago Closed 5 years ago

Crash in [@ std::_Func_impl_no_alloc<T>::_Do_call]

Categories

(Core :: DOM: Service Workers, defect, P1)

Desktop
All
defect

Tracking

()

VERIFIED FIXED
mozilla71
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- unaffected
firefox67 --- unaffected
firefox68 --- unaffected
firefox69 --- unaffected
firefox70 --- unaffected
firefox71 blocking verified

People

(Reporter: pascalc, Assigned: asuth)

References

(Regression)

Details

(Keywords: crash, regression, topcrash, Whiteboard: [rca - Coding Error])

Crash Data

Attachments

(2 files)

This bug is for crash report bp-045e7748-7d88-4c82-b52b-69f290191009.

Top 10 frames of crashing thread:

0 xul.dll void std::_Func_impl_no_alloc<`lambda at z:/task_1570615541/build/src/dom/serviceworkers/ServiceWorkerRegistration.cpp:241:7', void, const mozilla::dom::ServiceWorkerRegistrationDescriptor&>::_Do_call 
1 xul.dll void std::_Func_impl_no_alloc<`lambda at z:/task_1570615541/build/src/dom/serviceworkers/RemoteServiceWorkerRegistrationImpl.cpp:60:7', void, mozilla::dom::IPCServiceWorkerRegistrationDescriptorOrCopyableErrorResult&&>::_Do_call 
2 xul.dll mozilla::dom::PServiceWorkerRegistrationChild::OnMessageReceived ipc/ipdl/PServiceWorkerRegistrationChild.cpp:283
3 xul.dll mozilla::ipc::PBackgroundChild::OnMessageReceived ipc/ipdl/PBackgroundChild.cpp:5876
4 xul.dll mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:2109
5 xul.dll mozilla::ipc::MessageChannel::RunMessage ipc/glue/MessageChannel.cpp:1954
6 xul.dll mozilla::ipc::MessageChannel::MessageTask::Run ipc/glue/MessageChannel.cpp:1985
7 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1225
8 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
9 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:110

This signature is exploding since build 20191009103354. Setting to a blocker for 71 and P1.

Moving to DOM:Service Workers.

Component: DOM: Animation → DOM: Service Workers
Priority: -- → P1

Some correlations:

(64.29% in signature vs 10.59% overall) Module "imagehlp.dll" = true [83.33% vs 18.97% if platform_version = 10.0.18362]
(42.86% in signature vs 00.02% overall) moz_crash_reason = MOZ_DIAGNOSTIC_ASSERT(bc->GetOpenerId() == parentBC)
(42.86% in signature vs 02.89% overall) Module "winsta.dll" = true
(35.71% in signature vs 00.02% overall) moz_crash_reason = MOZ_DIAGNOSTIC_ASSERT(target->GetInFlightProcessId() == inFlightProcessId)

This crash was averaging 200-300 per crashes per nightly build, which is very high and in total has eclipsed 2600 crashes. In the last few days it has tapered down. Since we are coming up on the merge next week, can someone please take a look at this and try to figure out what is going on?

In terms of URLs, it does look as if based on the install time users were crashing consistently on these sites:

Flags: needinfo?(jstutte)

I guess this also might be related to Bug 1456995, since there are several other SW bugs that occurred after that landing.

OS: Windows 7 → Windows

Adding perry since he worked in the bug in Comment 3. This continues to average around 250 crashes per nightly build.

Flags: needinfo?(perry)

Andrew, could you help us find somebody to fix our recent #1 top crasher? Thanks!

Flags: needinfo?(overholt)

this tab crash basically instantly reproducible for me in a new profile on windows when visiting https://www.sammobile.com/ and accepting their privacy terms (not sure if the popup is only shown in the EU).

running mozregression on this confirms the hunch from comment #3 that this regressed with bug 1456995:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=720c1e5a8dd3f5bda4ae32137d1c624a1ad55301&tochange=be9a6289486a6f366e431782b84a0c0633f8fec2

at this point 98% of crashes come with MOZ_DIAGNOSTIC_ASSERT(global) which originally got added in bug 1456466 and/or bug 1454646.

Keywords: regression
Regressed by: 1456995
See Also: → 1456466, 1454646

This makes sense if the DOMEventTargetHelper has been disconnected from its owning global. It probably makes sense to turn this into an early return but I want to make sure the actor lifecycles are correct. The most likely explanation is a MozPromise adding an event loop turn.

Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Flags: needinfo?(perry)
Flags: needinfo?(overholt)
Flags: needinfo?(jstutte)

It appears the scenario is this:

(Perry also recognized this in the other signature variant of this bug https://bugzilla.mozilla.org/show_bug.cgi?id=1588149#c1.)

Crash Signature: [@ std::_Func_impl_no_alloc<T>::_Do_call] → [@ std::_Func_impl_no_alloc<T>::_Do_call] [@ std::_Function_handler<T>::_M_invoke]
Crash Signature: [@ std::_Func_impl_no_alloc<T>::_Do_call] [@ std::_Function_handler<T>::_M_invoke] → [@ std::_Func_impl_no_alloc<T>::_Do_call] [@ std::_Function_handler<T>::_M_invoke] [@ std::__1::__function::__func<T>::operator()]

This is effectively a reversion of the change made in
https://hg.mozilla.org/mozilla-central/rev/89c938649297#l1.39 when
DOMMozPromiseRequestHolder was introduced. I've tried to add some
comments to contextualize what's happening there and why it differs
from other similar callsites.

Longer term we might move to just deleting the underlying actor when
we are disconnected. Those actors were written assuming an
execution model where letting either end delete the actor would result
in intentional process crashes when a message was received for a
destroyed actor. That is no longer the case.

Because this seems fairly Gecko-specific, I've put this in our mozilla-specific
WPT dir. This is as opposed to writing this as a mochitest.

If I run this test without the fix, the browser crashes. If I run it with the fix, we
pass.

I run the test via
./mach wpt testing/web-platform/mozilla/tests/service-workers/update_completes_in_disconnected_global.https.html

Note that currently because of https://bugzilla.mozilla.org/show_bug.cgi?id=1587463#c13
I had to comment out the lsan_dir lines I refer to there locally to get the
test to run.

Depends on D49671

Pushed by bugmail@asutherland.org: https://hg.mozilla.org/integration/autoland/rev/1ab263e4b763 Do not assume global exists. r=perry https://hg.mozilla.org/integration/autoland/rev/9ed7fe36e723 Add a test that verifies we don't crash on a disconnected global. r=perry
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71

This bug has a clear regressor in Fx71. Looking at crash reports from older releases, they appear to be unrelated to this specific issue.

Hello! Reproduced the issue using websites from comment 2 with Firefox 71.0a1 (20191010214019) on Windows 10x64.
The issue is verified fixed using Firefox 71.0a1 (20191020214712) on Windows 10x64, macOS 10.14 and Ubuntu 18.04. No tab crashes encountered while randomly navigating on the affected websites.

Status: RESOLVED → VERIFIED
OS: Windows → All
Hardware: x86 → Desktop

This bug has been identified as part of a pilot on determining root causes of blocking and dot release drivers.

It needs a root-cause set for it. Please see the list at https://docs.google.com/document/d/1FFEGsmoU8T0N8R9kk-MXWptOPtXXXRRIe4vQo3_HgMw/.

Add the root cause as a whiteboard tag in the form [rca - <cause> ] and remove the rca-needed keyword.

If you have questions, please contact :tmaity.

Keywords: rca-needed
Keywords: rca-needed
Whiteboard: [rca - Coding Error]
See Also: → 1667335
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: