Closed Bug 1294719 Opened 8 years ago Closed 8 years ago

Crash in IPCError-content | (msgtype=0x820017,name=???) Route error: message sent to unknown actor ID

Categories

(Core :: Networking, defect)

48 Branch
Unspecified
All
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
e10s + ---
firefox48 --- wontfix
firefox49 + wontfix
firefox50 + fixed
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: jimm, Assigned: valentin)

References

Details

(Keywords: crash, Whiteboard: [necko-active])

Crash Data

Attachments

(3 files, 2 obsolete files)

proto: PHttpChannelMsgStart,
call: Msg_Redirect1Begin__ID,

This bug was filed from the Socorro interface and is 
report bp-2174424b-e436-45a2-92d6-ce0852160809.
=============================================================
Jason - can you find someone to look at this?
Flags: needinfo?(jduell.mcbugs)
Whiteboard: [necko-active]
Nick--how often is this crash happening?
Flags: needinfo?(jduell.mcbugs) → needinfo?(hurley)
Looking at the crash signature, there appear to be almost 600 of these submitted over the last week or so, about 75/day on average, mostly (but not exclusively) on Win 7. Once again, this is one of those things that (so far) doesn't show up on current aurora or nightly, but it has shown up on previous auroras (49.0a2), at least.
Flags: needinfo?(hurley) → needinfo?(jduell.mcbugs)
Bill: does this signature ring any bells for you? Is there any common error case where we see "Sent to unknown actor ID".  Would we see this if the child has exited/crashed?

Also I'm confused by the line in comment 0:

   proto: PHttpChannelMsgStart

We only have a PHttpChannel protocol.  Is the MsgStart thing an IPDL internal?
Flags: needinfo?(jduell.mcbugs) → needinfo?(wmccloskey)
Crash volume for signature 'IPCError-content | (msgtype=0x820017,name=???) Route error: message sent to unknown actor ID':
 - nightly (version 51): 0 crashes from 2016-08-01.
 - aurora  (version 50): 0 crashes from 2016-08-01.
 - beta    (version 49): 751 crashes from 2016-08-02.
 - release (version 48): 83 crashes from 2016-07-25.
 - esr     (version 45): 0 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       0       0       0
 - aurora        0       0       0
 - beta        264     296      68
 - release      39      11       3
 - esr           0       0       0

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
             Browser Content     Plugin
 - nightly
 - aurora
 - beta              #6
 - release           #6
 - esr
This means that the parent is sending this message to the child:
http://searchfox.org/mozilla-central/rev/f80822840bc5f3d2d3cae3ece621ddbce72e7f54/netwerk/protocol/http/HttpChannelParent.cpp#1291

However, the child has already called Send__delete__() on its copy of the same channel. Note that the parent might not have received ActorDestroy at this point. This is a pretty annoying aspect of IPDL. If you call Send__delete__() on some actor, you somehow need to know that it's not going to try to send any messages after that point (usually by sending a message asking it not to).
Flags: needinfo?(wmccloskey)
Assignee: nobody → valentin.gosu
[Tracking Requested - why for this release]:
this crash is now strongly on the rise since 2016-09-15 and since then it's making up about 10% of all crashes in the content process on 48.0.2.
this spike is mostly coming from "ru" locale builds and many user comments seem to indicate that their tab crashed when they tried accessing their webmail (i'm guessing it's about the "@mail.ru" provider).

another repeat theme in user comments seems to twitter and clicking on a share button.
tracking-e10s: --- → ?
(100.0% in signature vs 24.64% overall) dom_ipc_enabled = 1
(80.31% in signature vs 09.48% overall) useragent_locale = ru
(98.27% in signature vs 41.32% overall) reason = EXCEPTION_BREAKPOINT
(71.17% in signature vs 44.98% overall) platform_pretty_version = Windows 7
(62.63% in signature vs 38.21% overall) platform_version = 6.1.7601 Service Pack 1

The most occuring domain is ok.ru, but most crashes have an empty URL field.
Relman has found this crash is spiking over the last few days - (8.5% of all content crashes on 48.0.2). Can this ticket be escalated to the top of your list?
Status: NEW → ASSIGNED
Flags: needinfo?(valentin.gosu)
It is currently my top priority and I'm working on a fix.
Flags: needinfo?(valentin.gosu)
Thanks Valentin. A fix that we would ship in a dot release sounds like a good idea.  

Maybe in the meantime we can disable e10s for russian locales.  I don't like to think of people crashing while using a popular mail client, if we can prevent that, ship a fix fairly soon, we could then turn e10s back on right after a dot release. Brad, Felipe, what do you think? Can we do that with an update to the system addon?
Flags: needinfo?(felipc)
Flags: needinfo?(blassey.bugs)
n-i to jimm and billm since brad is traveling.
Flags: needinfo?(wmccloskey)
Flags: needinfo?(jmathies)
Flags: needinfo?(blassey.bugs)
Answering the question of whether we could: yeah

There was some conversations in #e10s about whether it's worth doing this or just wait for the problem to be fixed on the dot release.
I'll be paying attention on what's decided.
Flags: needinfo?(felipc)
Felipe knows much more than me about how to do this. Fixing this could take a couple more days at least, so if we can disable for Russian locales more quickly than that, it might be worth it.
Flags: needinfo?(wmccloskey)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #11)
> Thanks Valentin. A fix that we would ship in a dot release sounds like a
> good idea.  
> 
> Maybe in the meantime we can disable e10s for russian locales.  I don't like
> to think of people crashing while using a popular mail client, if we can
> prevent that, ship a fix fairly soon, we could then turn e10s back on right
> after a dot release. Brad, Felipe, what do you think? Can we do that with an
> update to the system addon?

Depends on the timing - If this bug is going to remain out there causing issues for a week or longer I'd say go ahead and disable e10s for the locale through the system add-on. If we plan to get a fix in within the next few days, we can wait.
Flags: needinfo?(jmathies)
Probably safe to assume the fix will need some bake time too. Valentin, what's your thinking? I'm leaning toward disabling for .ru temporarily so you have the time to get this reviewed, landed, and uplifted.
Flags: needinfo?(valentin.gosu)
I think it's fair to disable it for the locale. I don't think it would be wise to uplift the patch without letting it sit for a few days in m-c.
Flags: needinfo?(valentin.gosu)
Valentin, when you get this landed you should take the patch from bug 1304874 and land it together.
Has anyone been able to reproduce the issue? If so, can you put some STR in to the bug?

Andrei, maybe your team can dig into trying to reproduce this (or maybe you already have).  Thanks!
Flags: needinfo?(andrei.vaida)
Tracking for 49 and 50 since we know they are affected.
Have we seen a reduction in this crash with the deployment of bug 1304164?
Flags: needinfo?(mcastelluccio)
We've definitely seen a reduction (from ~1400 on Sep 22 to ~700 on Sep 24). It's been
steady since Sep 24 (between ~650 to ~850).

https://crash-stats.mozilla.com/signature/?product=Firefox&signature=IPCError-content%20%7C%20%28msgtype%3D0x820017%2Cname%3D%3F%3F%3F%29%20Route%20error%3A%20message%20sent%20to%20unknown%20actor%20ID&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1#graphs

The remaining crashes are still correlated with the Russian locale (less than before,
probably because some of them now have version 1.3 of the addon) and with e10s enabled:
https://crash-stats.mozilla.com/search/?signature=%3DIPCError-content%20%7C%20%28msgtype%3D0x820017%2Cname%3D%3F%3F%3F%29%20Route%20error%3A%20message%20sent%20to%20unknown%20actor%20ID&date=%3E%3D2016-09-24&product=Firefox&version=49.0.1&version=49.0&_sort=-date&_facets=signature&_facets=useragent_locale&_facets=addons&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-useragent_locale.

Many users still have version 1.2 of e10srollout@mozilla.org, some have version 1.3:
  3206 crash reports have been analyzed and the following versions have been found:
   - e10srollout@mozilla.org
     - 1.2: 2733
     - 1.3: 449

Tomorrow I will check how many users are experiencing this crash with the Russian locale
and the addon at version 1.3 (hopefully ~0).

I would expect the number of crash reports to reduce as more people are updated to 1.3, I'm
not sure why it hasn't happened yet between Sep 24 and now.
(In reply to Marco Castelluccio [:marco] from comment #24)
> Tomorrow I will check how many users are experiencing this crash with the
> Russian locale
> and the addon at version 1.3 (hopefully ~0).

Yes, just one report with locale=RU has version 1.3 of the addon (perhaps
they forcibly enabled e10s), all the other reports with locale=RU (2357)
have version 1.2 of the addon.
Flags: needinfo?(mcastelluccio)
Valentin, curious how the fix is coming along?
Flags: needinfo?(valentin.gosu)
I've hit a roadblock implementing the fix because of a quirk in how we do service worker redirect intercepting. This is causing regressions in a couple of tests.
:jdm has helped me debug the problem, but I'm still working on a fix.
Flags: needinfo?(valentin.gosu)
Crash Signature: [@ IPCError-content | (msgtype=0x820017,name=???) Route error: message sent to unknown actor ID] → [@ IPCError-content | (msgtype=0x820017,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | (msgtype=0x900017,name=???) Route error: message sent to unknown actor ID]
I have tried reproducing this issue on Windows 10 x64 and Windows 7 x64 using:
- Beta 49.0b1
- Beta 49.0b6
- RC 48
- RC 49.0.2
all RU locale builds, e10s on and off but couldn't reproduce the crash. 
I have accessed Mail.ru and OK.ru and other russian sites but all went smoothly. Please let me know when there is more information on the issue and I can give it another try.
Flags: needinfo?(andrei.vaida)
Crash Signature: [@ IPCError-content | (msgtype=0x820017,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | (msgtype=0x900017,name=???) Route error: message sent to unknown actor ID] → [@ IPCError-content | (msgtype=0x820017,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | (msgtype=0x900017,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | (msgtype=0x8A0017,name=???) Route err…
Comment on attachment 8799255 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

Review of attachment 8799255 [details] [diff] [review]:
-----------------------------------------------------------------

I don't love it, but I don't have better ideas. I'm pretty sure all of this complexity will go away with bug 1231222. I'd like to see another revision with my comments addressed.

::: netwerk/protocol/http/HttpChannelChild.cpp
@@ +212,1 @@
>      return 0;

This isn't right any more - DeallocPHttpChannel isn't called any more, so the refcount is still 1 at this point.

@@ +240,5 @@
>    NS_INTERFACE_MAP_ENTRY(nsIDivertableChannel)
> +  // we have no macro that covers this case.
> +  if (aIID.Equals(NS_GET_IID(HttpChannelChild)) ) {
> +      AddRef();
> +      *aInstancePtr = this;

NS_ADDREF(*aInstancePtr = this);

@@ +1133,5 @@
>    }
>    // WARNING:  DO NOT RELY ON |THIS| EXISTING ANY MORE! 
>    // 
>    // NeckoChild::DeallocPHttpChannelChild() may have been called, which deletes 
>    // |this| if IPDL holds the last reference.

This comment is now incorrect.

@@ +1173,4 @@
>    Send__delete__(this);
> +
> +  // If mCompletingInterceptedRedirect is true, that means the IPDL connection
> +  // was teared down by a interception logic in CompleteRedirectSetup, and we

s/teared/torn/

@@ +1205,5 @@
> +  NS_IMETHOD Run() override {
> +    bool ret = mChannel->Redirect3Complete(this);
> +
> +    // If the method returns false, it means the IPDL connection is being
> +    // asyncly teared down and reopened, and OverrideWithSynthesizedResponse

s/teared/torn/

@@ +1221,5 @@
> +void HttpChannelChild::FinishInterceptedRedirect()
> +{
> +  nsresult rv;
> +  if (mLoadInfo && mLoadInfo->GetEnforceSecurity()) {
> +      MOZ_ASSERT(!mInterceptedRedirectContext, "aContext should be null!");

nit: indentation, and aContext doesn't exist now.

@@ +1236,5 @@
> +  }
> +
> +  if (mOverrideRunnable) {
> +    RefPtr<OverrideRunnable> override = static_cast<OverrideRunnable*>(mOverrideRunnable.get());
> +    override->mNewChannel->OverrideWithSynthesizedResponse(override->mHead, override->mInput, override->mListener);

Is mNewChannel different than |this|?

@@ +1523,5 @@
> +// channel, or false the interception logic kicked in and we need to asyncly
> +// call FinishInterceptedRedirect and CleanupRedirectingChannel.
> +// The argument is an optional Override runnable that we pass to redirectChannel.
> +bool
> +HttpChannelChild::Redirect3Complete(Runnable* aRunnable)

Let's use the explicit OverrideRunnable type here.

@@ +1528,5 @@
>  {
>    LOG(("HttpChannelChild::Redirect3Complete [this=%p]\n", this));
>    nsresult rv = NS_OK;
>  
> +  RefPtr<HttpChannelChild> chan = do_QueryObject(mRedirectChannelChild);

Elsewhere I think we use QI to nsIHttpChannelChild and then static_cast. Not sure if it's worth adding the extra IID to do this here.

@@ +1543,5 @@
> +  if (!chan || !chan->mCompletingInterceptedRedirect) {
> +    // The redirect channel either isn't a HttpChannelChild, or the interception
> +    // logic wasn't triggered, so we can clean it up right here.
> +    CleanupRedirectingChannel(rv);
> +    chan->mOverrideRunnable = nullptr;

This executes if chan is null.

::: netwerk/protocol/http/HttpChannelChild.h
@@ +262,5 @@
> +  nsCOMPtr<nsISupports> mInterceptedRedirectContext;
> +  // Needed to call CleanupRedirectingChannel in FinishInterceptedRedirect
> +  RefPtr<HttpChannelChild> mCreator;
> +  // Used to call OverrideWithSynthesizedResponse in FinishInterceptedRedirect
> +  RefPtr<Runnable> mOverrideRunnable;

Can we use the OverrideRunnable type here if we forward-declare it?

::: netwerk/protocol/http/HttpChannelParent.cpp
@@ +1678,5 @@
> +HttpChannelParent::RecvDeletingChannel()
> +{
> +  // We need to ensure that the parent channel will not be sending any more IPC
> +  // messages after this, as the child is going away.
> +  return DoSendDeleteSelf();

We previously had a protocol where the parent could initiate deletion, and the child completes it. We're now using the protocol of child initiates delete, parent confirms it, child performs deletion. This seems inefficient, but maybe it doesn't matter. Fixing it would require allowing either side to call Send__delete__ in response to DeletingChannel/DeleteSelf.

::: netwerk/protocol/http/PHttpChannel.ipdl
@@ +84,5 @@
>    // to remove any matching entry from the CORS preflight cache.
>    async RemoveCorsPreflightCacheEntry(URIParams uri,
>                                        PrincipalInfo requestingPrincipal);
>  
> +  // Bug 1294719#c6 We call this in order to tell the parent not to send any

I don't think the bug number here is necessary. Perhaps better described as "The parent must not send the child any more messages after receiving this message."
Attachment #8799255 - Flags: review?(josh)
Comment on attachment 8799255 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

Review of attachment 8799255 [details] [diff] [review]:
-----------------------------------------------------------------

The approach with a loop through the parent is generally ok.  However, I'd personally do it a bit different way.  Instead of introducing a too general "async DeletingChannel()" I would add a one-purpose "async FinishInterceptedRedirect()" on *both* parent and child. 

On child it would be used (sent) instead of your SendDeletingChannel() - on the same place.

RecvFinishInterceptedRedirect on parent would set mIPCOpen = false (-> no more messages from the parent) and send back FinishInterceptedRedirect to the child as confirmation.

RecvFinishInterceptedRedirect on child would then go on with what CompleteRedirectSetup when mShouldParentIntercept == true was doing (Send__delete__(this) + AsyncOpen()).

It's more clean and may avoid some tight race conditions.  The inline comments mostly are pointing at some of those races and led me to this conclusion.

There are also few questions in the comment below, please check on them too.

::: netwerk/protocol/http/HttpChannelChild.cpp
@@ +207,5 @@
>    if (mKeptAlive && mRefCnt == 1 && mIPCOpen) {
>      mKeptAlive = false;
>      // Send_delete calls NeckoChild::DeallocPHttpChannel, which will release
>      // again to refcount==0
> +    SendDeletingChannel();

so, here is one major problem with this patch.  see the comment above?  it no longer applies.  the mIPCOpen flag is dropped to false in the unpatched code immediately and also Release() is called again from within Send__delete__, so the outer call to Release() actually releases the channel.  

but with the patch, nothing like this happens.

true is that only referrer of the channel is at this point the ipc sub-channel.  but when a message comes from ipc (from the parent) that leads to exposition of this http child channel to someone that addrefs/releases it before ipc is actually closed, you will call SendDeletingChannel() again...

is it possible to close ipc only in one way?  like shutdown(WRITE) on tcp sockets.  both sides would have to do it and the original problem would go away.  but I'm not sure the chromium ipc code allows that at all.  what you are trying to implement here is something similar, just harder to do, because you do it "on application level" - networking terminology used again.

so, what you need is unidirectional close flag for send and receive.  after this point (after you send SendDeletingChannel()) you must not send from child to parent any more.  We generally ensure that by checking the mIPCOpen flag, but that is left set with your patch.

you can't just drop it here because there would be no way then to release the last "reference" from IPC when the channel is deleted, IIUC.

@@ +238,5 @@
>    NS_INTERFACE_MAP_ENTRY(nsIHttpChannelChild)
>    NS_INTERFACE_MAP_ENTRY_CONDITIONAL(nsIAssociatedContentSecurity, GetAssociatedContentSecurity())
>    NS_INTERFACE_MAP_ENTRY(nsIDivertableChannel)
> +  // we have no macro that covers this case.
> +  if (aIID.Equals(NS_GET_IID(HttpChannelChild)) ) {

really? doesn't NS_INTERFACE_MAP_ENTRY or NS_IMPL_QUERY_BODY work for you?

@@ +944,5 @@
>      SendDocumentChannelCleanup();
>    } else {
> +    // The parent process will respond by sending a DeleteSelf message and
> +    // making sure not to send any more messages after that.
> +    SendDeletingChannel();

why don't you leave the current logic here?

@@ +1128,5 @@
>    // We're already being called from IPDL, therefore already "async"
>    HandleAsyncAbort();
>  
>    if (mIPCOpen) {
> +    SendDeletingChannel();

same here

@@ +1173,5 @@
>    Send__delete__(this);
> +
> +  // If mCompletingInterceptedRedirect is true, that means the IPDL connection
> +  // was teared down by a interception logic in CompleteRedirectSetup, and we
> +  // need to call FinishInterceptedRedirect afterwards.

could it happen that (Recv)DeleteSelf() is being called from a different reason while the flag is still set?

@@ +1229,5 @@
> +  }
> +  mInterceptedRedirectListener = nullptr;
> +  mInterceptedRedirectContext = nullptr;
> +
> +  if (mCreator) {

this member should have a better name, this is too general

@@ +1236,5 @@
> +  }
> +
> +  if (mOverrideRunnable) {
> +    RefPtr<OverrideRunnable> override = static_cast<OverrideRunnable*>(mOverrideRunnable.get());
> +    override->mNewChannel->OverrideWithSynthesizedResponse(override->mHead, override->mInput, override->mListener);

please enclose this as a method of the OverrideRunnable class.  maybe call it simply "void OverrideWithSynthesizedResponse()"

@@ +1528,5 @@
>  {
>    LOG(("HttpChannelChild::Redirect3Complete [this=%p]\n", this));
>    nsresult rv = NS_OK;
>  
> +  RefPtr<HttpChannelChild> chan = do_QueryObject(mRedirectChannelChild);

s/chan/httpChannelChild/

@@ +1575,5 @@
>      mInterceptListener = nullptr;
>    }
>  }
>  
> +

nit: remove this change

@@ +1648,5 @@
> +    // FinishInterceptedRedirect() which AsyncOpen's the channel again.
> +    SendDeletingChannel();
> +
> +    // XXX valentin: The interception logic should be rewritten to avoid
> +    // doing AsyncOpen on the channel after the Send__delete__()

can you say why?  looks like this patch is trying to do it, isn't it?  or I don't understand this comment at all?

::: netwerk/protocol/http/HttpChannelChild.h
@@ +38,5 @@
>  
>  class InterceptedChannelContent;
>  class InterceptStreamListener;
> +
> +#define NS_HTTPCHANNELCHILD_IID                       \

should this be _OID (object id) rather?

OTOH, would a method on nsIHttpChannelChild do better for you?

::: netwerk/protocol/http/HttpChannelParent.cpp
@@ +1677,5 @@
> +bool
> +HttpChannelParent::RecvDeletingChannel()
> +{
> +  // We need to ensure that the parent channel will not be sending any more IPC
> +  // messages after this, as the child is going away.

and how do we do it?  does DoSendDeleteSelf() do it?  if so, how?
Attachment #8799255 - Flags: review?(honzab.moz) → review-
* * *
[mq]: top

MozReview-Commit-ID: 4lb0gs32tOq
Attachment #8801579 - Flags: review?(josh)
Attachment #8801579 - Flags: review?(honzab.moz)
Attachment #8799255 - Attachment is obsolete: true
Crash Signature: ,name=???) Route error: message sent to unknown actor ID] → ,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | (msgtype=0x8E0017,name=???) Route error: message sent to unknown actor ID]
Comment on attachment 8801579 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

For some reason I hadn't noticed Honza had reviewed the previous patch.
I will address those comments as well in the next version.
Attachment #8801579 - Flags: review?(josh)
Attachment #8801579 - Flags: review?(honzab.moz)
I only now noticed that we never check mIPCClosed before calling SendRedirect1Begin which could also cause this crash (even if there's a race or not). I would like to land this simple patch first, and see if it does anything to the crash numbers. As BillM said in comment 6, this is possibly caused by a race between child and parent, but I'd like to make sure before we uplift a pretty complicated patch.
Attachment #8802568 - Flags: review?(honzab.moz)
* * *
[mq]: top

MozReview-Commit-ID: 4lb0gs32tOq
Attachment #8802569 - Flags: review?(josh)
Attachment #8802569 - Flags: review?(honzab.moz)
Attachment #8801579 - Attachment is obsolete: true
(In reply to Honza Bambas (:mayhemer) from comment #36)
> Instead of introducing a too general
> "async DeletingChannel()" I would add a one-purpose "async
> FinishInterceptedRedirect()" on *both* parent and child. 

I think we probably need a DeletingChannel as well. Otherwise, the child is doing releasing references, and the last one calls Send__delete__(), but the parent at the same time may Send another message which leads to this crash.

> true is that only referrer of the channel is at this point the ipc
> sub-channel.  but when a message comes from ipc (from the parent) that leads
> to exposition of this http child channel to someone that addrefs/releases it
> before ipc is actually closed, you will call SendDeletingChannel() again...
> 

This wasn't a problem because in RecvDeleteSelf we would indeed AddRef, but call Send__delete__() immediately. This sets mIPCOpen to false, so we don't end up calling SendDeletingChannel again. Actually, if we now have a RecvFinishInterceptingRedirect, we 

> 
> could it happen that (Recv)DeleteSelf() is being called from a different
> reason while the flag is still set?

I think it's possible. That's why I agree with your suggestion for a new message.
 
> > +    // XXX valentin: The interception logic should be rewritten to avoid
> > +    // doing AsyncOpen on the channel after the Send__delete__()
> 
> can you say why?  looks like this patch is trying to do it, isn't it?  or I
> don't understand this comment at all?

We still AsyncOpen the channel in FinishInterceptedRedirect, after calling Send__delete__(). Josh said he is addressing this in bug 1231222.
Comment on attachment 8802568 [details] [diff] [review]
Make sure to check mIPCClosed before calling SendRedirect1Begin

Review of attachment 8802568 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!  Let's try.  Thanks.
Attachment #8802568 - Flags: review?(honzab.moz) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/73ad49f175814ebeb16b8fb7b09b3217540332a9
Bug 1294719 - Make sure to check mIPCClosed before calling SendRedirect1Begin r=honzab
Crash volume for signature 'IPCError-content | (msgtype=0x8E0017,name=???) Route error: message sent to unknown actor ID':
 - nightly (version 52): 184 crashes from 2016-09-19.
 - aurora  (version 51): 563 crashes from 2016-09-19.
 - beta    (version 50): 1346 crashes from 2016-09-20.
 - release (version 49): 0 crashes from 2016-09-05.
 - esr     (version 45): 0 crashes from 2016-07-25.

Crash volume on the last weeks (Week N is from 10-17 to 10-23):
            W. N-1  W. N-2  W. N-3  W. N-4
 - nightly      22      16      79      66
 - aurora      172     137     184      32
 - beta        322     366     447     136
 - release       0       0       0       0
 - esr           0       0       0       0

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
             Browser Content     Plugin
 - nightly           #52
 - aurora            #6
 - beta              #5
 - release
 - esr
https://hg.mozilla.org/mozilla-central/rev/73ad49f17581
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
(In reply to Honza Bambas (:mayhemer) from comment #43)
> Comment on attachment 8802568 [details] [diff] [review]
> Make sure to check mIPCClosed before calling SendRedirect1Begin
> 
> Review of attachment 8802568 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Nice!  Let's try.  Thanks.

Didn't fix it :(
https://crash-stats.mozilla.com/report/index/a6ea2d48-fc79-4747-a53a-fe3322161021
This may be a bit late, but still worth:

(In reply to Valentin Gosu [:valentin] from comment #42)
> (In reply to Honza Bambas (:mayhemer) from comment #36)
> > Instead of introducing a too general
> > "async DeletingChannel()" I would add a one-purpose "async
> > FinishInterceptedRedirect()" on *both* parent and child. 
> 
> I think we probably need a DeletingChannel as well. Otherwise, the child is
> doing releasing references, and the last one calls Send__delete__(), but the
> parent at the same time may Send another message which leads to this crash.

Comment 6 is a good summary of what's happening here, yes, we may need it.  I also start thinking you should have FinishInterceptedRedirect message and DeletingChannel message as well.

Although, RecvFinishInterceptedRedirect will *not* close the IPC channel.  It will just do any necessary parent/child looping related to interception and redirects, nothing more.  To close ipc you would use explicitly DeletingChannel call.  So that you would have in your code something like (schematically):

  SendFinishInterceptedRedirect();
  SendDeletingChannel();

or a like...

DeletingChannel will be available on both sides (parent and child) and stand for sending a FIN packet in a TCP connection - by means "I'm done sending IPC to you, if you are as well, just delete the channel".  A bit more work, but when it proves itself, it could become a new standard of closing IPC making the fragile mIPCClosed flag obsolete.

More in detail, have two flags, mLocalIPCClosed and mRemoteIPCClosed (or better names).  mLocalIPCClosed means: I want my side of IPC close and will never send more messages (checked before each Send*()).  mRemoteIPCClosed means you got RecvDeletingChannel().  When *both* the flags are set, just do Send__delete__().  You need a method that examines these flags and call either DeletingChannel() or Send__delete__() based on mRemoteIPCClosed.  DeletingChannel and Send__Delete__ will only ever be called from that method!  On places you will want to close the IPC you will call this new IPC close wrapping method.

That's my few architectural cents.

> 
> > true is that only referrer of the channel is at this point the ipc
> > sub-channel.  but when a message comes from ipc (from the parent) that leads
> > to exposition of this http child channel to someone that addrefs/releases it
> > before ipc is actually closed, you will call SendDeletingChannel() again...
> > 
> 
> This wasn't a problem because in RecvDeleteSelf we would indeed AddRef, but
> call Send__delete__() immediately. This sets mIPCOpen to false, so we don't
> end up calling SendDeletingChannel again. Actually, if we now have a
> RecvFinishInterceptingRedirect, we 

Didn't finish the comment? :)  I'm not sure I follow anyway.  If you are trying to persuade me about a simpler solution, please do.  Just in a way I understand it :)

> We still AsyncOpen the channel in FinishInterceptedRedirect, after calling
> Send__delete__(). Josh said he is addressing this in bug 1231222.

Aha.  Got it.
I think you are completely right. However, this patch will probably have to be uplifted, and I really don't want to make too many changes. I'm aiming for the smallest amount of code change that doesn't change behaviour too much.
Do you think this patch is OK? I'd love to push this to m-c, to make sure it really fixes our problem.

(In reply to Honza Bambas (:mayhemer) from comment #48)
> Although, RecvFinishInterceptedRedirect will *not* close the IPC channel.

It does close it. Send__delete__() definitely closes. We reopen it by calling AsyncOpen.

> > This wasn't a problem because in RecvDeleteSelf we would indeed AddRef, but
> > call Send__delete__() immediately. This sets mIPCOpen to false, so we don't
> > end up calling SendDeletingChannel again. Actually, if we now have a
> > RecvFinishInterceptingRedirect, we 

I think I meant to say that RecvDeleteSelf doesn't actually need to keep a ref to the channel, since it doesn't call any other methods anymore.
RecvFinishInterceptingRedirect does need to keep a ref, but that's OK. The ref count will be 1 entering the call. When we addref it will be 2. Then we call Send__delete__() which leads to mIPCOpen=false, and the refcount goes to 1. Then we call AsyncOpen on the object, which calls AddIPDLReference and reopens the IPDL channel.
Flags: needinfo?(honzab.moz)
Crash Signature: ,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | (msgtype=0x8E0017,name=???) Route error: message sent to unknown actor ID] → ,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | (msgtype=0x8E0017,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | PHttpChannel::Msg_Redirect1Begin Route error: message sent to unknown actor …
Comment on attachment 8802569 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

Review of attachment 8802569 [details] [diff] [review]:
-----------------------------------------------------------------

locally tested with dom/workers/test/serviceworkers/test_openWindow.html.  code exercised, no crashes, no leaks.  but if this fixes the original reported crash, that's hard to say.

there are just few cosmetic changes to do, then land please.

::: netwerk/protocol/http/HttpChannelChild.cpp
@@ +1155,5 @@
>  
> +bool
> +HttpChannelChild::RecvFinishInterceptedRedirect()
> +{
> +  // Hold a ref to this to keep it from being deleted by Send__delete__()

here it makes sense to keep the ref...

@@ +1170,5 @@
>  void
>  HttpChannelChild::DeleteSelf()
>  {
> +  // Hold a ref to this to keep it from being deleted by Send__delete__()
> +  RefPtr<HttpChannelChild> self(this);

...but here it doesn't.  it will still delete before this method returns.  I removed this locally and it didn't crash.  if there is no reason (I don't see it, nor comment suggest any) please remove it.

@@ +1175,4 @@
>    Send__delete__(this);
>  }
>  
> +class OverrideRunnable : public Runnable {

should be in an anon namespace, but probably better would be declare it as a nested class of HttpChannelChild

@@ +1195,5 @@
> +  , mHead(aHead)
> +  {
> +  }
> +
> +  NS_IMETHOD Run() override {

{ on a new line

@@ +1234,5 @@
> +  }
> +
> +  if (mOverrideRunnable) {
> +    RefPtr<OverrideRunnable> override =
> +      static_cast<OverrideRunnable*>(mOverrideRunnable.get());

if OverrideRunnable is nested to httpchannelchild, you may keep a concrete type refptr.

@@ +1529,5 @@
>    LOG(("HttpChannelChild::Redirect3Complete [this=%p]\n", this));
>    nsresult rv = NS_OK;
>  
> +  nsCOMPtr<nsIHttpChannelChild> chan = do_QueryInterface(mRedirectChannelChild);
> +  RefPtr<HttpChannelChild> httpChannelChild = static_cast<HttpChannelChild*>(chan.get());

grrr... is there no other way than static cast?
Attachment #8802569 - Flags: review?(honzab.moz) → review+
Flags: needinfo?(honzab.moz)
https://hg.mozilla.org/integration/mozilla-inbound/rev/de41e684fb55d7e64a3eecf8ee0e4279f37e0724
Bug 1294719 - Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__() r=honzab
As a matter of reproduction, I quite often experience this on Linux x86_64 loading guardian.co.uk.
Is my review here still required since this merged?
(In reply to Josh Matthews [:jdm] from comment #56)
> Is my review here still required since this merged?
Not really. :) Your initial review was very helpful. But let me know if you saw something that seems wrong in the patch.
Comment on attachment 8802569 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

Approval Request Comment
[Feature/regressing bug #]: Preexisting bug in e10s implementation.
[User impact if declined]: A race condition may cause crashes when the parent process sends a message after the child has tried to close the IPC channel. Most commonly happens with redirects.
[Describe test coverage new/current, TreeHerder]: We have unit tests checking that redirects and service workers still work properly.
[Risks and why]: This changes the way we shut down the IPC channel. We try to keep behaviour just the same. The risk is low, but not zero.
[String/UUID change made/needed]: None.
Attachment #8802569 - Flags: review?(josh)
Attachment #8802569 - Flags: approval-mozilla-beta?
Attachment #8802569 - Flags: approval-mozilla-aurora?
Comment on attachment 8802569 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

Let's stabilize this change on Aurora51 and then decide whether it's sufficiently low risk to include in a 50 RC build or not.
Attachment #8802569 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8802569 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

The data from Nightly seems promising on this one. I am taking a big leap of faith on this one, taking it in 50 RC1. If things look worse, we will plan a backout. Aurora51+, Beta50+
Attachment #8802569 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
This is hitting conflicts applying to beta.
Flags: needinfo?(valentin.gosu)
Rebased on beta.
Flags: needinfo?(valentin.gosu) → needinfo?(wkocher)
Hi Jim, Erin, going back on the history of this bug, we decided to disable e10s on RU locale to work around this problem. Is e10s going to be disabled for RU users in 50 as well? In that case, I would prefer not to uplift this fix to Beta50. The crash rates on 50 are finally looking good and I'd hate to disrupt that.
Flags: needinfo?(jmathies)
Flags: needinfo?(elancaster)
Comment on attachment 8802569 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

Hi Wes, Tomcat, please hold off on uplifting this to Beta50. I am a bit unclear on the severity of the issue and feel like it might be too late to take this in 50.
Attachment #8802569 - Flags: approval-mozilla-beta+ → approval-mozilla-beta?
(In reply to Ritu Kothari (:ritu) from comment #65)
> Hi Jim, Erin, going back on the history of this bug, we decided to disable
> e10s on RU locale to work around this problem. Is e10s going to be disabled
> for RU users in 50 as well? In that case, I would prefer not to uplift this
> fix to Beta50. The crash rates on 50 are finally looking good and I'd hate
> to disrupt that.

We want e10s on for russian users before 51 lands in release so we can either do that through an uplift now (necko engineers should comment on how comfortable they are with a late 50 uplift) or through an eventual 50 point release.
Flags: needinfo?(jmathies)
Flags: needinfo?(elancaster)
Of course I'd be more comfortable with doing a point release, if it has a few weeks to bake on beta, but a late uplift shouldn't be too risky. Also, we should probably coordinate it with re-enabling e10s for Russian users.
Based on nightly crash reports, we can safely say this bug is fixed.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Thanks Jim and Valentin. Let's not uplift this in 50 but rather in the planned 50.1.0 release. By then this fix will have baked on Aurora51/Beta51 for a few weeks.
Attachment #8802569 - Flags: approval-mozilla-beta?
Comment on attachment 8802569 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

Flagging for potential inclusion into the december point release.

Approval Request Comment
[Feature/regressing bug #]:
regression caused by e10s / necko code
[User impact if declined]:
e10s is disabled for russian users due to this, we want to get those users turned back on.
[Describe test coverage new/current, TreeHerder]:
patch landed on nightly/aurora.
[Risks and why]: 
medium, invasive necko patch, but pretty well tested in current releases that have the fix.
[String/UUID change made/needed]:
none
Attachment #8802569 - Flags: approval-mozilla-release?
Comment on attachment 8802569 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

This was a top crasher in 49,50. It has stabilized for a few weeks on pre-release channels, let's include it in 50.1.0
Attachment #8802569 - Flags: approval-mozilla-release? → approval-mozilla-release+
Hi Valentin, Jim, this fix hasn't worked well on the release channel it seems. I see an increase in crash rates on 50.1.0 release and Philipp mentioned that this bug might be a potential reason. Can you please investigate? Hoping that by the time Fx51 will go-live mid-Jan we will have an updated fix.
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(jmathies)
sorry for being too ambiguous - the signature is spiking up http://bit.ly/2h2NYFU most probably due to some third-party website change (ad networks?), but only on 50.0/50.0.1/50.0.2 - so the fix we included in 50.1 seems to do it's job!
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(jmathies)
Regressions: 1651661
See Also: → 1651661
You need to log in before you can comment on or make changes to this bug.