ChildImpl::OpenProtocolOnMainThread crashes when attempting to send a message during shutdown

RESOLVED FIXED in Firefox 51

Status

()

Core
IPC
P2
critical
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: ting, Assigned: cyu)

Tracking

({crash, topcrash-mac})

Trunk
mozilla52
Unspecified
Mac OS X
crash, topcrash-mac
Points:
---

Firefox Tracking Flags

(e10s+, firefox48 wontfix, firefox49 wontfix, firefox-esr45 wontfix, firefox50 wontfix, firefox51 fixed, firefox52 fixed)

Details

(Whiteboard: btpp-backlog, crash signature)

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(3 attachments, 1 obsolete attachment)

(Reporter)

Description

2 years ago
This bug was filed from the Socorro interface and is 
report bp-c35a89a5-deb8-4c1e-be49-a46202160606.
=============================================================

#2 of Mac OS X Nightly 20160605030215, 4 crashes from 2 installations. It's a MOZ_CRASH from opening PBackground failure.

Comment 1

2 years ago
Is this a new regression? I don't think that 4 crashes would typically be enough to investigate a bug report, in case this is developers doing odd things.
Flags: needinfo?(janus926)
That's four crashes for a single build, on OSX, which is fairly high for OSX. I see 51 crashes in the last 7 days, mostly on 49.

These are all happening during shutdown, so it is likely more fallout from the "submit old crash reports" thing.

BackgroundChild does have a sShutdownHasStarted variable that tracks when shutdown has started, but it is waiting for NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID, which I think is partway through XPCOM shutdown, so it isn't catching when we trigger this in the earlier part of XPCOM shutdown, when we are presumably still not allowed to send IPC messages. That would be easy enough to fix to listen for whatever earlier point when we can't send IPC messages any more... but when we hit that condition we also crash, so it wouldn't really improve things! Maybe the behavior could be changed to fail more gracefully.
Summary: Crash in (anonymous namespace)::ChildImpl::OpenProtocolOnMainThread → ChildImpl::OpenProtocolOnMainThread crashes when attempting to send a message during shutdown
tracking-e10s: --- → ?
Keywords: topcrash-mac
I don't know how much it matters, but in two of the crash reports I looked at, what is happening is that the child is receiving a destroy message, which causes it to trigger PageHide, which runs some script, which triggers what I think is a sync XHR that spins the event loop, which then tries to create this background child thing.
See Also: → bug 1277582
Ting-Yu, are you interested/able to investigate more here? At first glance this doesn't seem super-urgent.
Whiteboard: btpp-backlog
Bill has a patch in bug 1268559 that might help this.
Depends on: 1268559
(Reporter)

Comment 6

2 years ago
Let's see if bug 1268559 could help this.
Flags: needinfo?(janus926)

Updated

2 years ago
tracking-e10s: ? → +
Priority: -- → P2
Crash Signature: [@ (anonymous namespace)::ChildImpl::OpenProtocolOnMainThread] → [@ (anonymous namespace)::ChildImpl::OpenProtocolOnMainThread] [@ `anonymous namespace''::ChildImpl::OpenProtocolOnMainThread ]
See Also: → bug 1140115
Crash volume for signature '(anonymous namespace)::ChildImpl::OpenProtocolOnMainThread':
 - nightly (version 50): 35 crashes from 2016-06-06.
 - aurora  (version 49): 110 crashes from 2016-06-07.
 - beta    (version 48): 1 crash from 2016-06-06.
 - release (version 47): 0 crashes from 2016-05-31.
 - esr     (version 45): 0 crashes from 2016-04-07.

Crash volume on the last weeks:
            W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - nightly       6       3       1       2       0       7       9
 - aurora       11       7       6       8      12      29      29
 - beta          0       0       1       0       0       0       0
 - release       0       0       0       0       0       0       0
 - esr           0       0       0       0       0       0       0

Affected platform: Mac OS X
status-firefox48: --- → affected
status-firefox49: --- → affected
Crash volume for signature '`anonymous namespace''::ChildImpl::OpenProtocolOnMainThread':
 - nightly (version 51): 147 crashes from 2016-08-01.
 - aurora  (version 50): 280 crashes from 2016-08-01.
 - beta    (version 49): 185 crashes from 2016-08-02.
 - release (version 48): 6 crashes from 2016-07-25.
 - esr     (version 45): 5 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly      42      51      31
 - aurora      105      89      26
 - beta          5      85      78
 - release       3       0       1
 - esr           2       0       1

Affected platform: Windows

Crash rank on the last 7 days:
           Browser     Content   Plugin
 - nightly #13
 - aurora  #16
 - beta    #803
 - release #4554
 - esr     #4043
status-firefox51: --- → affected
status-firefox-esr45: --- → affected
bug 1277582 just landed, let's see if this will disappear too.
Hello Kan-Ru, I just checked the crash signature and I see crash reports from 09-12 aurora50 build. I don't think the fix from bug 1277582 address this signature. Will you be working on an alternate fix for it?
Flags: needinfo?(kchen)
I can investigate this.
Assignee: nobody → kchen
Flags: needinfo?(kchen)
Assignee: kchen → cyu
Crash volume for signature '`anonymous namespace''::ChildImpl::OpenProtocolOnMainThread':
 - nightly (version 52): 148 crashes from 2016-09-19.
 - aurora  (version 51): 56 crashes from 2016-09-19.
 - beta    (version 50): 2 crashes from 2016-09-20.
 - release (version 49): 10 crashes from 2016-09-05.
 - esr     (version 45): 21 crashes from 2016-06-01.

Crash volume on the last weeks (Week N is from 10-03 to 10-09):
            W. N-1  W. N-2
 - nightly      48     100
 - aurora       45      11
 - beta          2       0
 - release       4       6
 - esr           1       5

Affected platform: Windows

Crash rank on the last 7 days:
           Browser     Content   Plugin
 - nightly #11
 - aurora  #21
 - beta    #4616
 - release #6227
 - esr     #4483
status-firefox52: --- → affected
(Assignee)

Comment 14

2 years ago
Created attachment 8797581 [details] [diff] [review]
A test case that crashes the browser.

After applying the patch, just open and then close the browser. Then the content process crashes.
(Assignee)

Comment 15

2 years ago
Created attachment 8797584 [details] [diff] [review]
WIP

This patch fails ChildImpl::OpenProtocolOnMainThread() gently after PContentChild is closed.
It looks like CamerasChild can handle the failure in opening BackgroundChild.
The change in RuntimeService seems to work, but I am not sure if it's the right fix.
Attachment #8797584 - Flags: review?(wmccloskey)
Comment on attachment 8797584 [details] [diff] [review]
WIP

Review of attachment 8797584 [details] [diff] [review]:
-----------------------------------------------------------------

If we decide to make the worker change, Blake should probably review that. I know very little about workers. Otherwise this looks good to me.

::: dom/workers/RuntimeService.cpp
@@ +2547,5 @@
>  
>    // Too many idle threads, just shut this one down.
>    if (shutdownThread) {
>      MOZ_ALWAYS_SUCCEEDS(aThread->Shutdown());
>    } else if (scheduleTimer) {

Maybe combine the mIdleThreadTimer check with this one?

@@ -2787,5 @@
>    //       mWorkerPrivate->SetThread() in order to avoid accidentally consuming
>    //       worker messages here.
>    if (NS_WARN_IF(!BackgroundChild::SynchronouslyCreateForCurrentThread())) {
>      // XXX need to fire an error at parent.
> -    return NS_ERROR_UNEXPECTED;

I guess I don't see why we need to change this. If the worker fails to start up when we're shutting down, would that be bad? It seems pretty reasonable to me.

::: ipc/glue/BackgroundImpl.cpp
@@ +2049,5 @@
>  
>      return true;
>    }
>  
> +  if (sIPCShutdown) {

Can you just use
  if (XRE_IsContentProcess() && ContentChild::GetSingleton()->IsShuttingDown())
?

Then you could remove the observer.
Attachment #8797584 - Flags: review?(wmccloskey)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(Assignee)

Updated

2 years ago
Attachment #8797584 - Attachment is obsolete: true
(Assignee)

Comment 19

2 years ago
(In reply to Bill McCloskey (:billm) from comment #16)
> Comment on attachment 8797584 [details] [diff] [review]
> WIP
> 
> Review of attachment 8797584 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> If we decide to make the worker change, Blake should probably review that. I
> know very little about workers. Otherwise this looks good to me.
> 
> ::: dom/workers/RuntimeService.cpp
> @@ +2547,5 @@
> >  
> >    // Too many idle threads, just shut this one down.
> >    if (shutdownThread) {
> >      MOZ_ALWAYS_SUCCEEDS(aThread->Shutdown());
> >    } else if (scheduleTimer) {
> 
> Maybe combine the mIdleThreadTimer check with this one?
> 
This is related to the problem in test case design. Starting a worker in "xpcom-shutdow" phase confuses RuntimeService's cleanup sequence because RuntimeService also observes "xpcom-shutdow" to Shutdown() itself.

> @@ -2787,5 @@
> >    //       mWorkerPrivate->SetThread() in order to avoid accidentally consuming
> >    //       worker messages here.
> >    if (NS_WARN_IF(!BackgroundChild::SynchronouslyCreateForCurrentThread())) {
> >      // XXX need to fire an error at parent.
> > -    return NS_ERROR_UNEXPECTED;
> 
> I guess I don't see why we need to change this. If the worker fails to start
> up when we're shutting down, would that be bad? It seems pretty reasonable
> to me.
> 
It'd be pretty bad because the worker will not be unregistered. Then RuntimeService will wait forever for this leaked worker during shutdown.
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #19)
> (In reply to Bill McCloskey (:billm) from comment #16)
> > Comment on attachment 8797584 [details] [diff] [review]
> > WIP
> > 
> > Review of attachment 8797584 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > If we decide to make the worker change, Blake should probably review that. I
> > know very little about workers. Otherwise this looks good to me.
> > 
> > ::: dom/workers/RuntimeService.cpp
> > @@ +2547,5 @@
> > >  
> > >    // Too many idle threads, just shut this one down.
> > >    if (shutdownThread) {
> > >      MOZ_ALWAYS_SUCCEEDS(aThread->Shutdown());
> > >    } else if (scheduleTimer) {
> > 
> > Maybe combine the mIdleThreadTimer check with this one?
> > 
> This is related to the problem in test case design. Starting a worker in
> "xpcom-shutdow" phase confuses RuntimeService's cleanup sequence because
> RuntimeService also observes "xpcom-shutdow" to Shutdown() itself.

I just meant it would be easier to write this:
  } else if (scheduleTimer && mIdleThreadTimer) {
rather than:
  } else if (scheduleTimer) {
    if (mIdleThreadTimer) {

> > @@ -2787,5 @@
> > >    //       mWorkerPrivate->SetThread() in order to avoid accidentally consuming
> > >    //       worker messages here.
> > >    if (NS_WARN_IF(!BackgroundChild::SynchronouslyCreateForCurrentThread())) {
> > >      // XXX need to fire an error at parent.
> > > -    return NS_ERROR_UNEXPECTED;
> > 
> > I guess I don't see why we need to change this. If the worker fails to start
> > up when we're shutting down, would that be bad? It seems pretty reasonable
> > to me.
> > 
> It'd be pretty bad because the worker will not be unregistered. Then
> RuntimeService will wait forever for this leaked worker during shutdown.

OK, I was afraid that might be the case. I wonder how hard it would be to actually notify the parent of the error.
Comment on attachment 8797962 [details]
Bug 1278443 - Part 1: Don't open PBackground actors after PContentChild is closed.

https://reviewboard.mozilla.org/r/83580/#review83282
Attachment #8797962 - Flags: review?(wmccloskey) → review+
Hey Cervantes, sorry for my delay in reviewing this. I'll make sure to finish this review either later today or tomorrow.

Comment 23

2 years ago
mozreview-review
Comment on attachment 8797963 [details]
Bug 1278443 - Part 2: Continue to run the worker even if BackgroundChild fails to create.

https://reviewboard.mozilla.org/r/83582/#review84330

I'm not sure I fully understand all of the implications of this, but let's try it.
Attachment #8797963 - Flags: review?(mrbkap) → review+
Flags: needinfo?(cyu)

Comment 26

2 years ago
Pushed by cyu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/53a144cdc973
Part 1: Don't open PBackground actors after PContentChild is closed. r=billm
https://hg.mozilla.org/integration/autoland/rev/d4858e65d890
Part 2: Continue to run the worker even if BackgroundChild fails to create. r=mrbkap

Comment 27

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/53a144cdc973
https://hg.mozilla.org/mozilla-central/rev/d4858e65d890
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox52: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Is this worth trying to backport?
status-firefox48: affected → wontfix
status-firefox49: affected → wontfix
status-firefox-esr45: affected → wontfix
Flags: needinfo?(cyu)
Keywords: intermittent-failure
Hi Kan-Ru,
Since :cyu is OOO, this might need your help to create uplift request. Thanks.
Flags: needinfo?(cyu) → needinfo?(kchen)
Comment on attachment 8797962 [details]
Bug 1278443 - Part 1: Don't open PBackground actors after PContentChild is closed.

Approval Request Comment
[Feature/regressing bug #]:
[User impact if declined]: Fixed a shutdown crash
[Describe test coverage new/current, TreeHerder]: on m-c. Baked on Nightly.
[Risks and why]: Low. This patch disallows PBackground creation when shutdown; PBackground client should already handle connection failure gracefully.
[String/UUID change made/needed]:
Flags: needinfo?(kchen)
Attachment #8797962 - Flags: approval-mozilla-aurora?
Comment on attachment 8797962 [details]
Bug 1278443 - Part 1: Don't open PBackground actors after PContentChild is closed.

Fix a crash. Take it in 51 aurora.
Attachment #8797962 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
kanru, cervantes: does part 2 also need uplifting ?
Flags: needinfo?(cyu)
Bug 925645 seems a lot more likely.
status-firefox50: affected → wontfix
Flags: needinfo?(cyu)
You need to log in before you can comment on or make changes to this bug.