Closed Bug 1277582 Opened 3 years ago Closed 3 years ago

PNeckoChild::SendPHttpChannelConstructor called during shutdown

Categories

(Core :: Networking: HTTP, defect, P1, critical)

48 Branch
All
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
e10s + ---
firefox48 --- wontfix
firefox49 + wontfix
firefox-esr45 --- unaffected
firefox50 + fixed
firefox51 + fixed

People

(Reporter: marcia, Assigned: billm)

References

Details

(Keywords: crash, topcrash, Whiteboard: [necko-active])

Crash Data

Attachments

(3 files, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-59626cfd-df24-4c42-8446-90a822160602.
=============================================================

Seen while looking at nightly crash stats. Signatures appear for both Windows and Mac. First appearance on Mac seems to be 5-14.
This might be a signature change for bug 1152087.
[Tracking Requested - why for this release]:
#1 browser side top crasher in 49
I looked at a few reports, and I noticed that almost all of them have a ShutdownProgress value, like profile-before-change or xpcom-shutdown. I'm not sure where that comes from (the child doesn't have profile-before change), but we might be trying to create a new HttpChannel after shutdown has started. The constructors seem to be mostly or all being called from HttpChannelChild::ContinueAsyncOpen.
Component: Networking → Networking: HTTP
Duplicate of this bug: 1152087
This has risen to nearly 5% of all crashes on Nightly.

note at time of this comment: How are we getting an installation count of 354 when there are only 222 crash reports on 49.0a1?
tracking-e10s: --- → ?
I think this spike is from the backwash of old crash reports from blassey's patch. Looking through the first page, most of them are from a week or more ago, which makes sense as this is some kind of shutdown crash. So I think it isn't a real top crash.

Also, why are we doing anything with PNeckoChild in the parent? I find that surprising.
If you click on "by build date" instead of "by report date" on the top crash list, this is only #10, which about 1% of crashes. So, serious, but not as bad as it looks.
Keywords: topcrash
Summary: Crash in mozalloc_abort | Abort | NS_DebugBreak | mozilla::ipc::FatalError | mozilla::net::PNeckoChild::SendPHttpChannelConstructor → Parent process calls PNeckoChild::SendPHttpChannelConstructor during shutdown
Flags: needinfo?(mcmanus)
(In reply to Andrew McCreight [:mccr8] from comment #6)
> Also, why are we doing anything with PNeckoChild in the parent? I find that
> surprising.

I guess the parent process is the one that sends a message to start up the child, or something.

Anyways, I'd guess what we need to do here is the usual IPC shutdown dance where something checks if we've shutdown somehow, and if so, don't send the message.
This has risen up to the #1 top browser crash in nightly. We're merging today so it's going to move to aurora as well.
Flags: needinfo?(jduell.mcbugs)
jason is the right person to look at this and he's looking right now..
Flags: needinfo?(mcmanus)
Valentin is taking this.  Dragana points out it may be related to bug 1277961
Assignee: nobody → valentin.gosu
Flags: needinfo?(jduell.mcbugs)
(In reply to blinky from comment #13)
> Regression range:
> https://hg.mozilla.org/integration/mozilla-inbound/
> pushloghtml?fromchange=02c69f4f896255189ce2f9f4e0d875e383bcfbd7&tochange=a1bd
> 47d76f71162534090485acc57866dcd55eef
> 
> Regressed by: bug 1229961

Thanks for testing and providing the video. Curious how you came up with this regression range?
Flags: needinfo?(over68)
There's a mixture of stacks in here. The plugin _geturlnotify during plugin shutdown calls that blinky experienced are part of this, but not 100% of the stacks.
(In reply to Jim Mathies [:jimm] from comment #14)
> (In reply to blinky from comment #13)
> > Regression range:
> > https://hg.mozilla.org/integration/mozilla-inbound/
> > pushloghtml?fromchange=02c69f4f896255189ce2f9f4e0d875e383bcfbd7&tochange=a1bd
> > 47d76f71162534090485acc57866dcd55eef
> > 
> > Regressed by: bug 1229961
> 
> Thanks for testing and providing the video. Curious how you came up with
> this regression range?

With setting the pref dom.ipc.plugins.asyncdrawing.enabled to false, I can not reproduce the crash.
Flags: needinfo?(over68)
Let's track this crash for 49.
Do you think you could try installing this build and letting us know if you can reproduce the bug?
http://archive.mozilla.org/pub/firefox/try-builds/valentin.gosu@gmail.com-05e6aad0172e900900dd2d0ea04df3798ddccb6f/try-win32/firefox-49.0a1.en-US.win32.installer.exe

Thanks!
Flags: needinfo?(over68)
(In reply to Valentin Gosu [:valentin] from comment #19)
> Do you think you could try installing this build and letting us know if you
> can reproduce the bug?
> http://archive.mozilla.org/pub/firefox/try-builds/valentin.gosu@gmail.com-
> 05e6aad0172e900900dd2d0ea04df3798ddccb6f/try-win32/firefox-49.0a1.en-US.
> win32.installer.exe
> 
> Thanks!

I can not reproduce the crash with this build, thanks.
Flags: needinfo?(over68)
MozReview-Commit-ID: F6pCCn4jPVb
Attachment #8760642 - Flags: review?(mcmanus)
Do you know why we are calling HttpChannelChild::Continue... in the parent process?
I don't think this is the parent process. There are plugin-container related things in the crash-report.
(In reply to Valentin Gosu [:valentin] from comment #23)
> I don't think this is the parent process. There are plugin-container related
> things in the crash-report.

Ok, only the title is wrong :)
Attachment #8760642 - Flags: review?(mcmanus)
Attachment #8760642 - Flags: review?(josh)
Attachment #8760642 - Flags: review?(dd.mozilla)
Attachment #8760642 - Flags: feedback+
Summary: Parent process calls PNeckoChild::SendPHttpChannelConstructor during shutdown → PNeckoChild::SendPHttpChannelConstructor called during shutdown
Attachment #8760642 - Flags: review?(dd.mozilla) → review+
The crash stack in comment 0 is the exact same problem as bug 1277961; it's not anything related to shutdown in particular, it's that the TabChild being used in the XHR is no longer valid. In this particular crash stack, the original stack is from TabChild::RecvDestroy - this would ordinarily be fine, since it enqueues a runnable that sends the TabChild::__delete__ message to the parent. However, the stack shows us starting a synchronous XHR via the "unload" event dispatched to the tab child global, which processes the runnable and deletes the tab child (since there's apparently a service worker involved that causes the XHR networking to be asynchronous). A more robust fix that would deal with bug 1277961 as well would be to check the state of the tab child object we want to use before using it.
Comment on attachment 8760642 [details] [diff] [review]
Don't send SendPHttpChannelConstructor ipdl messages during child shutdown

As explained in the previous comment, I think the fact that this works is merely coincidence.
Attachment #8760642 - Flags: review?(josh) → review-
(In reply to Josh Matthews [:jdm] from comment #26)
> Comment on attachment 8760642 [details] [diff] [review]
> Don't send SendPHttpChannelConstructor ipdl messages during child shutdown
> 
> As explained in the previous comment, I think the fact that this works is
> merely coincidence.

I don't think it's a mere coincidence, this is one of the codepaths that lead to us sending IPC messages after the shutdown has begun. Indeed, it seems that bug 1277961 is a generalization of this issue, and would not be fixed by this patch.
I'll have a chat with :jduell to see what are the options for fixing this.
Whiteboard: [necko-active]
Priority: -- → P1
So, I talked about this issue with :BillM and it seems that attachment 8759907 [details] [diff] [review] would fix this issue (and bug 1277961).
See Also: → 1268559
What seems to be a related shutdown crash involving the HTTP channel is in bug 1274886.
See Also: → 1274886
MozReview-Commit-ID: F6pCCn4jPVb
Attachment #8765914 - Flags: review?(josh)
Attachment #8765914 - Flags: review?(dd.mozilla)
Attachment #8760642 - Attachment is obsolete: true
The FatalError that's being invoked in that crash report is different:
```
    bool sendok__ = (mChannel)->Send(msg__);
    if ((!(sendok__))) {
        FatalError("constructor for actor failed");
        return nullptr;
    }
```
That's the same stack trace as the crash report reported in https://bugzilla.mozilla.org/show_bug.cgi?id=1277961#c5, too.
I'm not sure what to do about that; this patch shouldn't affect it at all, so we'll probably need input from someone like billm who knows the IPC layer well.
Flags: needinfo?(wmccloskey)
Is this related to the crash from bug 1277961? 
This is the #1 topcrash in aurora so I'm hoping once this new patch lands we can judge if it's effective and uplift it.
Flags: needinfo?(valentin.gosu)
The stack trace in bug 1277961 also contains SendPHttpChannelConstructor.
I fully expect that bug would be fixed by this patch.
Flags: needinfo?(valentin.gosu)
It looks like there are several issues here. The simplest one (which mainly seems to be covered by bug 1277961) is that the PContent link to the parent process has unexpectedly closed. I'm not sure why that happens--maybe the parent crashed. Any attempt to send at this point will fail. The best thing we can do is avoid sending. So I think Valentin's patch will work in this case. So I think we should proceed with this patch, but maybe it should be moved to bug 1277961.

However, there's a separate issue where we're shutting down normally. We should be allowed to send sync XHRs at this point since we delete the tab actor in a runnable. However, it looks like the XHR tries to send more IPC stuff in a separate turn of the event loop from the initial XHR, and it somehow has to do with service workers? This part I don't really understand. What does ResetInterception mean?

Anyway, the ideal solution for this problem is to avoid tearing down the tab actor until the XHR has completed. I think we could use a similar solution as in bug 1268559: the Send__delete__ runnable for TabChild would be put on hold if we're in a nested event loop.

However, it would be great if we had a test case for this issue. It's pretty easy to send an XHR from an "unload" or "pagehide" handler, but I don't know how to get to this ResetInterception situation. Josh, can you answer that or needinfo someone who can? It will be a lot easier to patch this with a test case.
Flags: needinfo?(wmccloskey) → needinfo?(josh)
(In reply to Bill McCloskey (:billm) from comment #39)
> However, there's a separate issue where we're shutting down normally. We
> should be allowed to send sync XHRs at this point since we delete the tab
> actor in a runnable. However, it looks like the XHR tries to send more IPC
> stuff in a separate turn of the event loop from the initial XHR, and it
> somehow has to do with service workers? This part I don't really understand.
> What does ResetInterception mean?
> 
> Anyway, the ideal solution for this problem is to avoid tearing down the tab
> actor until the XHR has completed. I think we could use a similar solution
> as in bug 1268559: the Send__delete__ runnable for TabChild would be put on
> hold if we're in a nested event loop.
> 
> However, it would be great if we had a test case for this issue. It's pretty
> easy to send an XHR from an "unload" or "pagehide" handler, but I don't know
> how to get to this ResetInterception situation. Josh, can you answer that or
> needinfo someone who can? It will be a lot easier to patch this with a test
> case.

With regards to the test case, the only difference is that a page with an unload/pagehide handler needs to register an empty service worker - this will cause the ResetInterception code path to be taken, which makes AsyncOpen become more asynchronous than usual.
Flags: needinfo?(josh)
Comment on attachment 8765914 [details] [diff] [review]
Don't send SendPHttpChannelConstructor ipdl messages during child shutdown

Review of attachment 8765914 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not convinced about how much this will help, but it certainly shouldn't hurt besides covering up cases where we should be sending valid actors but can't due to bugs.
Attachment #8765914 - Flags: review?(josh) → review+
Attachment #8765914 - Flags: review?(dd.mozilla) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/c465467931d3806df0a28d3351d6eb15351b561a
Bug 1277582 - Don't send SendPHttpChannelConstructor ipdl messages during child shutdown r=dragana,jdm
https://hg.mozilla.org/mozilla-central/rev/c465467931d3
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
This isn't fixed. We still need to figure out what to do during normal shutdown.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Duplicate of this bug: 1277961
It appears that the latest patch hasn't fixed the issue either:
https://crash-stats.mozilla.com/report/index/67dbee6c-acf9-44ed-92d2-ce61e2160709

This is quite confusing to me. I was expecting that tabChild->IsDestroyed() would be true in this case.
Bill, is there any other reason why ``` bool sendok__ = (mChannel)->Send(msg__); ``` might return false, other than shutdown?
Flags: needinfo?(wmccloskey)
This seems to be causing us to leak outer windows.  See the STR in bug 1286008.  After backing this patch out the leaks go away.

If this doesn't even fix the origin problem can we just back it out please?
Flags: needinfo?(valentin.gosu)
I suspect twitter leaks with this patch because it uses a service worker without a fetch event handler.  Its always hitting the ResetInterception() path which basically does nothing with an error code from ContinueAsyncOpen():

https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/HttpChannelChild.cpp#2531

Josh, it seems like maybe ResetInterception() should do something with that error code to abort the channel?
Flags: needinfo?(josh)
Actually, I have an easy fix for the bad error handling in bug 1286258.  I'll just fix that there.
Flags: needinfo?(josh)
This is #3 crash on windows nightly build 20160711034039 with 21 crashes.
As I said in comment 39, I don't expect the patch here to fix this bug. I had expected it to fix bug 1277961 (the patch really should have been moved there). However, it looks like it didn't fix that bug. I think the reason is that we're doing the wrong check. IsDestroyed() only checks if the tab has been destroyed gracefully. We should be using IPCOpen() instead.

Comment 39 describes a change that I think would actually fix this bug. Valentin, is that something you could work on?
Flags: needinfo?(wmccloskey)
Crash Signature: mozalloc_abort | NS_DebugBreak | mozilla::ipc::FatalError | mozilla::net::PNeckoChild::FatalError | mozilla::net::PNeckoChild::SendPHttpChannelConstructor] → mozalloc_abort | NS_DebugBreak | mozilla::ipc::FatalError | mozilla::net::PNeckoChild::FatalError | mozilla::net::PNeckoChild::SendPHttpChannelConstructor] [@ mozalloc_abort | NS_DebugBreak | mozilla::ipc::FatalError | mozilla::net::PNeckoChild::Write] …
Hi! 
I have many crashes on some .pdf files and my crash reports suggests this bug.
These crashes happens only with e10s  enabled in all versions: FF 47/ Beta / Nightly.
FF freeze and I should kill process from task manager (or kill Firefox plugin container wich remains at 30-35% CPU usage)
I tried everything: clean profile/clean installation/ safe mode..the issue remains.
OS: WIN 8.1 pro
Attachment #8771178 - Flags: review?(wmccloskey) → review+
(In reply to Bill McCloskey (:billm) from comment #51)
> Comment 39 describes a change that I think would actually fix this bug.
> Valentin, is that something you could work on?

I'm actually not very familiar with the workings of TabChild, which is probably why I didn't fix this properly.
If you could help me with that I would really appreciate it.
Flags: needinfo?(valentin.gosu)
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2c0a0814add0
Check if TabChild is still available before calling SendPHttpChannelConstructor. r=billm
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/2c0a0814add0
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Flags: needinfo?(wmccloskey)
Resolution: FIXED → ---
Bug 1277961 (the PNeckoChild::Write signature) appears to be fixed now!
I wasn't able to reproduce the crash, but I was able to reproduce the XHR failing when a service worker was registered. This patch fixes that problem. Hopefully it will solve the crashes too.

Blake, the basic problem is that there's a sync XHR and within the nested event loop we're tearing down the PBrowser. That causes the XHR to fail/crash. We already protect against tearing down PContent in this situation. This patch just protects PBrowser in the same (ugly) way. Eventually I would like to avoid running (at least some) IPC message handlers from within nested event loops, but that will require quite a bit more testing.
Flags: needinfo?(wmccloskey)
Attachment #8772627 - Flags: review?(mrbkap)
Attachment #8772627 - Flags: review?(mrbkap) → review+
Still seeing the crashes. Bill, is the patch ready to land?
Flags: needinfo?(wmccloskey)
Not quite. There were some test failures that I need to figure out.
Flags: needinfo?(wmccloskey)
Assignee: valentin.gosu → wmccloskey
Marking this affected for 50 since the bug was reopened.   Once we have a verified fix, we may want to uplift it as far as beta since it looks like a top crash in beta 1.
OS: All → Windows
Version: Trunk → 49 Branch
Version: 49 Branch → 48 Branch
Tracking 51+ for this crash which we can check on nightly volume once the fix lands.
Something else must have mitigated this crash for 49, maybe something Dragana worked on. It isn't a topcrash in beta anymore,  but still looks significant volume in 51 and 50.
(In reply to Bill McCloskey (:billm) from comment #60)
> Not quite. There were some test failures that I need to figure out.

Bill: Any progress yet? This remains #4 top browser crash on nightly.
Flags: needinfo?(wmccloskey)
I spent quite a bit of time on this today. It turns out my original patch wouldn't have fixed the problem. I'm not sure why I thought it would.

However, I've been able to reproduce a crash that looks identical to the ones we're seeing in the reports. Here's what I think is happening:

1. The user quits while there is a page open that has a service worker registered. The page has initiated an XHR at some point (maybe in an onunload handler, but could be anywhere).
2. We go through the shutdown sequence, sending the Destroy message to every tab and then the Shutdown message to the ContentChild.
3. The parent calls the Close() method on the PContent channel.
4. The child receives the goodbye message on the IPC thread and posts a runnable on the main thread to call ActorDestroy on the ContentChild. It immediately sets the channel state to closed.
5. At this point, the XHR rises up and decides it wants to ContinueAsyncOpen. Even though the tab it's associated with is dead, it doesn't notice this because the tabChild we get from GetCallbacks is null anyway. I'm not sure why this is. When it sends the PHttpChannelConstructor message, Connected() is false for the channel and we assert.
6. We write out a minidump, but the parent never intercepts it because the ContentParent for that process is already gone. That's why the ProcessType is being set incorrectly for these crashes (bug 1299964).

I'll prepare a patch for this ASAP. We can patch around this, but it exposes some fundamental issues in how IPDL deals with failure.

Since this is a shutdown crash that only gets reported through the infobar, it's invisible to release users. So it's not a major priority.
Flags: needinfo?(wmccloskey)
Please see the previous comment for an explanation of what's going on. Basically, there's a short time interval between when the IPC thread finds out that the channel is closed and when the main thread finds out that the channel is closed where we'll crash if we send any constructor messages. It seems like networking is where this tends to happen, but it could happen anywhere.

This patch fixes the problem for networking. The parent won't Close() the channel until the child sends the FinishShutdown message.

We really need to fix this more generally though. I'll file something for that.
Attachment #8772627 - Attachment is obsolete: true
Attachment #8787798 - Flags: review?(mrbkap)
Comment on attachment 8787798 [details] [diff] [review]
don't send HTTP constructor when shutting down

Review of attachment 8787798 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with my question addressed.

::: netwerk/protocol/http/HttpChannelChild.cpp
@@ +1998,5 @@
>    AddIPDLReference();
>  
> +  ContentChild* cc = static_cast<ContentChild*>(gNeckoChild->Manager());
> +  if (cc->IsShuttingDown()) {
> +    return NS_ERROR_FAILURE;

Shouldn't this happen before the AddIPDLReference call? If we don't call SendPHttpChannelConstructor then we never actually transfer ownership. It isn't a big deal because we're shutting down anyway, but might be worth fixing.
Attachment #8787798 - Flags: review?(mrbkap) → review+
See Also: → 1282776
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f7fc4fab8018
Don't send HTTP constructor when content process has started to shut down (r=mrbkap)
Good point. I moved the check up above the addref.
https://hg.mozilla.org/mozilla-central/rev/f7fc4fab8018
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Hi Bill, should we consider uplifting this fix to Aurora50?
Flags: needinfo?(wmccloskey)
Sure. I would like to hold off uplifting to beta since this crash doesn't really affect release users, and we should make sure there are no regressions.
Flags: needinfo?(wmccloskey)
Comment on attachment 8787798 [details] [diff] [review]
don't send HTTP constructor when shutting down

This fixes a content process crash that is only visible to users when the unsubmitted crashes infobar pops up. It cannot lead to data loss as far as I know.

Approval Request Comment
[Feature/regressing bug #]: e10s I think
[User impact if declined]: more unsubmitted crash reports
[Describe test coverage new/current, TreeHerder]: on m-c for a little while
[Risks and why]: Simple patch, pretty low risk.
[String/UUID change made/needed]: None
Attachment #8787798 - Flags: approval-mozilla-aurora?
I'll wait for this to stabilize on Nightly51 for a day or two before approving the uplift to Aurora50.
Comment on attachment 8787798 [details] [diff] [review]
don't send HTTP constructor when shutting down

Crash fix, verified that the data from Nightly looks promising so far, Aurora50+
Attachment #8787798 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Target Milestone: mozilla50 → mozilla51
It looks like this worked. Last nightly crash was on 9/7.
Duplicate of this bug: 1299964
Duplicate of this bug: 1282776
Blocks: 1363659
You need to log in before you can comment on or make changes to this bug.