Bugzilla

Comment 8

•

8 years ago

(100.0% in signature vs 24.64% overall) dom_ipc_enabled = 1
(80.31% in signature vs 09.48% overall) useragent_locale = ru
(98.27% in signature vs 41.32% overall) reason = EXCEPTION_BREAKPOINT
(71.17% in signature vs 44.98% overall) platform_pretty_version = Windows 7
(62.63% in signature vs 38.21% overall) platform_version = 6.1.7601 Service Pack 1

The most occuring domain is ok.ru, but most crashes have an empty URL field.

Erin Lancaster [:elan]

Comment 9

•

8 years ago

Relman has found this crash is spiking over the last few days - (8.5% of all content crashes on 48.0.2). Can this ticket be escalated to the top of your list?

Status: NEW → ASSIGNED

Flags: needinfo?(valentin.gosu)

Assignee

Comment 10

•

8 years ago

It is currently my top priority and I'm working on a fix.

Flags: needinfo?(valentin.gosu)

Comment 11

•

8 years ago

Thanks Valentin. A fix that we would ship in a dot release sounds like a good idea.  

Maybe in the meantime we can disable e10s for russian locales.  I don't like to think of people crashing while using a popular mail client, if we can prevent that, ship a fix fairly soon, we could then turn e10s back on right after a dot release. Brad, Felipe, what do you think? Can we do that with an update to the system addon?

Flags: needinfo?(felipc)

Flags: needinfo?(blassey.bugs)

Comment 12

•

8 years ago

n-i to jimm and billm since brad is traveling.

Flags: needinfo?(wmccloskey)

Flags: needinfo?(jmathies)

Flags: needinfo?(blassey.bugs)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 13

•

8 years ago

Answering the question of whether we could: yeah

There was some conversations in #e10s about whether it's worth doing this or just wait for the problem to be fixed on the dot release.
I'll be paying attention on what's decided.

Flags: needinfo?(felipc)

Comment 14

•

8 years ago

Felipe knows much more than me about how to do this. Fixing this could take a couple more days at least, so if we can disable for Russian locales more quickly than that, it might be worth it.

Flags: needinfo?(wmccloskey)

Updated

•

8 years ago

Blocks: 1304164

Assignee

Comment 15

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3eee34b0cf65

Reporter

Comment 16

•

8 years ago

(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #11)
> Thanks Valentin. A fix that we would ship in a dot release sounds like a
> good idea.  
> 
> Maybe in the meantime we can disable e10s for russian locales.  I don't like
> to think of people crashing while using a popular mail client, if we can
> prevent that, ship a fix fairly soon, we could then turn e10s back on right
> after a dot release. Brad, Felipe, what do you think? Can we do that with an
> update to the system addon?

Depends on the timing - If this bug is going to remain out there causing issues for a week or longer I'd say go ahead and disable e10s for the locale through the system add-on. If we plan to get a fix in within the next few days, we can wait.

Flags: needinfo?(jmathies)

Reporter

Comment 17

•

8 years ago

Probably safe to assume the fix will need some bake time too. Valentin, what's your thinking? I'm leaning toward disabling for .ru temporarily so you have the time to get this reviewed, landed, and uplifted.

Flags: needinfo?(valentin.gosu)

Assignee

Comment 18

•

8 years ago

I think it's fair to disable it for the locale. I don't think it would be wise to uplift the patch without letting it sit for a few days in m-c.

Flags: needinfo?(valentin.gosu)

Assignee

Comment 19

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=de233b4cfb12

Reporter

Updated

•

8 years ago

tracking-e10s: ? → +

Updated

•

8 years ago

Blocks: 1304874

Comment 20

•

8 years ago

Valentin, when you get this landed you should take the patch from bug 1304874 and land it together.

Comment 21

•

8 years ago

Has anyone been able to reproduce the issue? If so, can you put some STR in to the bug?

Andrei, maybe your team can dig into trying to reproduce this (or maybe you already have).  Thanks!

Flags: needinfo?(andrei.vaida)

Comment 22

•

8 years ago

Tracking for 49 and 50 since we know they are affected.

status-firefox50: --- → affected

tracking-firefox49: ? → +

tracking-firefox50: --- → +

Comment 23

•

8 years ago

Have we seen a reduction in this crash with the deployment of bug 1304164?

Flags: needinfo?(mcastelluccio)

Comment 24

•

8 years ago

We've definitely seen a reduction (from ~1400 on Sep 22 to ~700 on Sep 24). It's been
steady since Sep 24 (between ~650 to ~850).

https://crash-stats.mozilla.com/signature/?product=Firefox&signature=IPCError-content%20%7C%20%28msgtype%3D0x820017%2Cname%3D%3F%3F%3F%29%20Route%20error%3A%20message%20sent%20to%20unknown%20actor%20ID&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1#graphs

The remaining crashes are still correlated with the Russian locale (less than before,
probably because some of them now have version 1.3 of the addon) and with e10s enabled:
https://crash-stats.mozilla.com/search/?signature=%3DIPCError-content%20%7C%20%28msgtype%3D0x820017%2Cname%3D%3F%3F%3F%29%20Route%20error%3A%20message%20sent%20to%20unknown%20actor%20ID&date=%3E%3D2016-09-24&product=Firefox&version=49.0.1&version=49.0&_sort=-date&_facets=signature&_facets=useragent_locale&_facets=addons&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-useragent_locale.

Many users still have version 1.2 of e10srollout@mozilla.org, some have version 1.3:
  3206 crash reports have been analyzed and the following versions have been found:
   - e10srollout@mozilla.org
     - 1.2: 2733
     - 1.3: 449

Tomorrow I will check how many users are experiencing this crash with the Russian locale
and the addon at version 1.3 (hopefully ~0).

I would expect the number of crash reports to reduce as more people are updated to 1.3, I'm
not sure why it hasn't happened yet between Sep 24 and now.

Comment 25

•

8 years ago

(In reply to Marco Castelluccio [:marco] from comment #24)
> Tomorrow I will check how many users are experiencing this crash with the
> Russian locale
> and the addon at version 1.3 (hopefully ~0).

Yes, just one report with locale=RU has version 1.3 of the addon (perhaps
they forcibly enabled e10s), all the other reports with locale=RU (2357)
have version 1.2 of the addon.

Flags: needinfo?(mcastelluccio)

Reporter

Comment 26

•

8 years ago

Valentin, curious how the fix is coming along?

Flags: needinfo?(valentin.gosu)

Assignee

Comment 27

•

8 years ago

I've hit a roadblock implementing the fix because of a quirk in how we do service worker redirect intercepting. This is causing regressions in a couple of tests.
:jdm has helped me debug the problem, but I'm still working on a fix.

Flags: needinfo?(valentin.gosu)

Reporter

Updated

•

8 years ago

Crash Signature: [@ IPCError-content | (msgtype=0x820017,name=???) Route error: message sent to unknown actor ID] → [@ IPCError-content | (msgtype=0x820017,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | (msgtype=0x900017,name=???) Route error: message sent to unknown actor ID]

Assignee

Comment 29

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5b15045dedac

Assignee

Comment 30

•

8 years ago

Attached patch Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__() (obsolete) — Details — Splinter Review

Attachment #8799255 - Flags: review?(josh)

Attachment #8799255 - Flags: review?(honzab.moz)

Assignee

Comment 31

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d534b490c99

Alexandru Simonca [:asimonca], Release Desktop QA

Assignee

Comment 32

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b77692f41e33

Comment 33

•

8 years ago

I have tried reproducing this issue on Windows 10 x64 and Windows 7 x64 using:
- Beta 49.0b1
- Beta 49.0b6
- RC 48
- RC 49.0.2
all RU locale builds, e10s on and off but couldn't reproduce the crash. 
I have accessed Mail.ru and OK.ru and other russian sites but all went smoothly. Please let me know when there is more information on the issue and I can give it another try.

Flags: needinfo?(andrei.vaida)

Kan-Ru Chen [:kanru] (UTC+9)

Updated

•

8 years ago

Crash Signature: [@ IPCError-content | (msgtype=0x820017,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | (msgtype=0x900017,name=???) Route error: message sent to unknown actor ID] → [@ IPCError-content | (msgtype=0x820017,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | (msgtype=0x900017,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | (msgtype=0x8A0017,name=???) Route err…

Josh Matthews [:jdm]

Comment 35

•

8 years ago

Comment on attachment 8799255 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

Review of attachment 8799255 [details] [diff] [review]:
-----------------------------------------------------------------

I don't love it, but I don't have better ideas. I'm pretty sure all of this complexity will go away with bug 1231222. I'd like to see another revision with my comments addressed.

::: netwerk/protocol/http/HttpChannelChild.cpp
@@ +212,1 @@
>      return 0;

This isn't right any more - DeallocPHttpChannel isn't called any more, so the refcount is still 1 at this point.

@@ +240,5 @@
>    NS_INTERFACE_MAP_ENTRY(nsIDivertableChannel)
> +  // we have no macro that covers this case.
> +  if (aIID.Equals(NS_GET_IID(HttpChannelChild)) ) {
> +      AddRef();
> +      *aInstancePtr = this;

NS_ADDREF(*aInstancePtr = this);

@@ +1133,5 @@
>    }
>    // WARNING:  DO NOT RELY ON |THIS| EXISTING ANY MORE! 
>    // 
>    // NeckoChild::DeallocPHttpChannelChild() may have been called, which deletes 
>    // |this| if IPDL holds the last reference.

This comment is now incorrect.

@@ +1173,4 @@
>    Send__delete__(this);
> +
> +  // If mCompletingInterceptedRedirect is true, that means the IPDL connection
> +  // was teared down by a interception logic in CompleteRedirectSetup, and we

s/teared/torn/

@@ +1205,5 @@
> +  NS_IMETHOD Run() override {
> +    bool ret = mChannel->Redirect3Complete(this);
> +
> +    // If the method returns false, it means the IPDL connection is being
> +    // asyncly teared down and reopened, and OverrideWithSynthesizedResponse

s/teared/torn/

@@ +1221,5 @@
> +void HttpChannelChild::FinishInterceptedRedirect()
> +{
> +  nsresult rv;
> +  if (mLoadInfo && mLoadInfo->GetEnforceSecurity()) {
> +      MOZ_ASSERT(!mInterceptedRedirectContext, "aContext should be null!");

nit: indentation, and aContext doesn't exist now.

@@ +1236,5 @@
> +  }
> +
> +  if (mOverrideRunnable) {
> +    RefPtr<OverrideRunnable> override = static_cast<OverrideRunnable*>(mOverrideRunnable.get());
> +    override->mNewChannel->OverrideWithSynthesizedResponse(override->mHead, override->mInput, override->mListener);

Is mNewChannel different than |this|?

@@ +1523,5 @@
> +// channel, or false the interception logic kicked in and we need to asyncly
> +// call FinishInterceptedRedirect and CleanupRedirectingChannel.
> +// The argument is an optional Override runnable that we pass to redirectChannel.
> +bool
> +HttpChannelChild::Redirect3Complete(Runnable* aRunnable)

Let's use the explicit OverrideRunnable type here.

@@ +1528,5 @@
>  {
>    LOG(("HttpChannelChild::Redirect3Complete [this=%p]\n", this));
>    nsresult rv = NS_OK;
>  
> +  RefPtr<HttpChannelChild> chan = do_QueryObject(mRedirectChannelChild);

Elsewhere I think we use QI to nsIHttpChannelChild and then static_cast. Not sure if it's worth adding the extra IID to do this here.

@@ +1543,5 @@
> +  if (!chan || !chan->mCompletingInterceptedRedirect) {
> +    // The redirect channel either isn't a HttpChannelChild, or the interception
> +    // logic wasn't triggered, so we can clean it up right here.
> +    CleanupRedirectingChannel(rv);
> +    chan->mOverrideRunnable = nullptr;

This executes if chan is null.

::: netwerk/protocol/http/HttpChannelChild.h
@@ +262,5 @@
> +  nsCOMPtr<nsISupports> mInterceptedRedirectContext;
> +  // Needed to call CleanupRedirectingChannel in FinishInterceptedRedirect
> +  RefPtr<HttpChannelChild> mCreator;
> +  // Used to call OverrideWithSynthesizedResponse in FinishInterceptedRedirect
> +  RefPtr<Runnable> mOverrideRunnable;

Can we use the OverrideRunnable type here if we forward-declare it?

::: netwerk/protocol/http/HttpChannelParent.cpp
@@ +1678,5 @@
> +HttpChannelParent::RecvDeletingChannel()
> +{
> +  // We need to ensure that the parent channel will not be sending any more IPC
> +  // messages after this, as the child is going away.
> +  return DoSendDeleteSelf();

We previously had a protocol where the parent could initiate deletion, and the child completes it. We're now using the protocol of child initiates delete, parent confirms it, child performs deletion. This seems inefficient, but maybe it doesn't matter. Fixing it would require allowing either side to call Send__delete__ in response to DeletingChannel/DeleteSelf.

::: netwerk/protocol/http/PHttpChannel.ipdl
@@ +84,5 @@
>    // to remove any matching entry from the CORS preflight cache.
>    async RemoveCorsPreflightCacheEntry(URIParams uri,
>                                        PrincipalInfo requestingPrincipal);
>  
> +  // Bug 1294719#c6 We call this in order to tell the parent not to send any

I don't think the bug number here is necessary. Perhaps better described as "The parent must not send the child any more messages after receiving this message."

Attachment #8799255 - Flags: review?(josh)

Comment 36

•

8 years ago

Comment on attachment 8799255 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

Review of attachment 8799255 [details] [diff] [review]:
-----------------------------------------------------------------

The approach with a loop through the parent is generally ok.  However, I'd personally do it a bit different way.  Instead of introducing a too general "async DeletingChannel()" I would add a one-purpose "async FinishInterceptedRedirect()" on *both* parent and child. 

On child it would be used (sent) instead of your SendDeletingChannel() - on the same place.

RecvFinishInterceptedRedirect on parent would set mIPCOpen = false (-> no more messages from the parent) and send back FinishInterceptedRedirect to the child as confirmation.

RecvFinishInterceptedRedirect on child would then go on with what CompleteRedirectSetup when mShouldParentIntercept == true was doing (Send__delete__(this) + AsyncOpen()).

It's more clean and may avoid some tight race conditions.  The inline comments mostly are pointing at some of those races and led me to this conclusion.

There are also few questions in the comment below, please check on them too.

::: netwerk/protocol/http/HttpChannelChild.cpp
@@ +207,5 @@
>    if (mKeptAlive && mRefCnt == 1 && mIPCOpen) {
>      mKeptAlive = false;
>      // Send_delete calls NeckoChild::DeallocPHttpChannel, which will release
>      // again to refcount==0
> +    SendDeletingChannel();

so, here is one major problem with this patch.  see the comment above?  it no longer applies.  the mIPCOpen flag is dropped to false in the unpatched code immediately and also Release() is called again from within Send__delete__, so the outer call to Release() actually releases the channel.  

but with the patch, nothing like this happens.

true is that only referrer of the channel is at this point the ipc sub-channel.  but when a message comes from ipc (from the parent) that leads to exposition of this http child channel to someone that addrefs/releases it before ipc is actually closed, you will call SendDeletingChannel() again...

is it possible to close ipc only in one way?  like shutdown(WRITE) on tcp sockets.  both sides would have to do it and the original problem would go away.  but I'm not sure the chromium ipc code allows that at all.  what you are trying to implement here is something similar, just harder to do, because you do it "on application level" - networking terminology used again.

so, what you need is unidirectional close flag for send and receive.  after this point (after you send SendDeletingChannel()) you must not send from child to parent any more.  We generally ensure that by checking the mIPCOpen flag, but that is left set with your patch.

you can't just drop it here because there would be no way then to release the last "reference" from IPC when the channel is deleted, IIUC.

@@ +238,5 @@
>    NS_INTERFACE_MAP_ENTRY(nsIHttpChannelChild)
>    NS_INTERFACE_MAP_ENTRY_CONDITIONAL(nsIAssociatedContentSecurity, GetAssociatedContentSecurity())
>    NS_INTERFACE_MAP_ENTRY(nsIDivertableChannel)
> +  // we have no macro that covers this case.
> +  if (aIID.Equals(NS_GET_IID(HttpChannelChild)) ) {

really? doesn't NS_INTERFACE_MAP_ENTRY or NS_IMPL_QUERY_BODY work for you?

@@ +944,5 @@
>      SendDocumentChannelCleanup();
>    } else {
> +    // The parent process will respond by sending a DeleteSelf message and
> +    // making sure not to send any more messages after that.
> +    SendDeletingChannel();

why don't you leave the current logic here?

@@ +1128,5 @@
>    // We're already being called from IPDL, therefore already "async"
>    HandleAsyncAbort();
>  
>    if (mIPCOpen) {
> +    SendDeletingChannel();

same here

@@ +1173,5 @@
>    Send__delete__(this);
> +
> +  // If mCompletingInterceptedRedirect is true, that means the IPDL connection
> +  // was teared down by a interception logic in CompleteRedirectSetup, and we
> +  // need to call FinishInterceptedRedirect afterwards.

could it happen that (Recv)DeleteSelf() is being called from a different reason while the flag is still set?

@@ +1229,5 @@
> +  }
> +  mInterceptedRedirectListener = nullptr;
> +  mInterceptedRedirectContext = nullptr;
> +
> +  if (mCreator) {

this member should have a better name, this is too general

@@ +1236,5 @@
> +  }
> +
> +  if (mOverrideRunnable) {
> +    RefPtr<OverrideRunnable> override = static_cast<OverrideRunnable*>(mOverrideRunnable.get());
> +    override->mNewChannel->OverrideWithSynthesizedResponse(override->mHead, override->mInput, override->mListener);

please enclose this as a method of the OverrideRunnable class.  maybe call it simply "void OverrideWithSynthesizedResponse()"

@@ +1528,5 @@
>  {
>    LOG(("HttpChannelChild::Redirect3Complete [this=%p]\n", this));
>    nsresult rv = NS_OK;
>  
> +  RefPtr<HttpChannelChild> chan = do_QueryObject(mRedirectChannelChild);

s/chan/httpChannelChild/

@@ +1575,5 @@
>      mInterceptListener = nullptr;
>    }
>  }
>  
> +

nit: remove this change

@@ +1648,5 @@
> +    // FinishInterceptedRedirect() which AsyncOpen's the channel again.
> +    SendDeletingChannel();
> +
> +    // XXX valentin: The interception logic should be rewritten to avoid
> +    // doing AsyncOpen on the channel after the Send__delete__()

can you say why?  looks like this patch is trying to do it, isn't it?  or I don't understand this comment at all?

::: netwerk/protocol/http/HttpChannelChild.h
@@ +38,5 @@
>  
>  class InterceptedChannelContent;
>  class InterceptStreamListener;
> +
> +#define NS_HTTPCHANNELCHILD_IID                       \

should this be _OID (object id) rather?

OTOH, would a method on nsIHttpChannelChild do better for you?

::: netwerk/protocol/http/HttpChannelParent.cpp
@@ +1677,5 @@
> +bool
> +HttpChannelParent::RecvDeletingChannel()
> +{
> +  // We need to ensure that the parent channel will not be sending any more IPC
> +  // messages after this, as the child is going away.

and how do we do it?  does DoSendDeleteSelf() do it?  if so, how?

Attachment #8799255 - Flags: review?(honzab.moz) → review-

Assignee

Comment 37

•

8 years ago

Attached patch Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__() (obsolete) — Details — Splinter Review

* * *
[mq]: top

MozReview-Commit-ID: 4lb0gs32tOq

Attachment #8801579 - Flags: review?(josh)

Attachment #8801579 - Flags: review?(honzab.moz)

Assignee

Updated

•

8 years ago

Attachment #8799255 - Attachment is obsolete: true

Assignee

Updated

•

8 years ago

Crash Signature: ,name=???) Route error: message sent to unknown actor ID] → ,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | (msgtype=0x8E0017,name=???) Route error: message sent to unknown actor ID]

Assignee

Comment 39

•

8 years ago

Comment on attachment 8801579 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

For some reason I hadn't noticed Honza had reviewed the previous patch.
I will address those comments as well in the next version.

Attachment #8801579 - Flags: review?(josh)

Attachment #8801579 - Flags: review?(honzab.moz)

Assignee

Comment 40

•

8 years ago

Attached patch Make sure to check mIPCClosed before calling SendRedirect1Begin — Details — Splinter Review

I only now noticed that we never check mIPCClosed before calling SendRedirect1Begin which could also cause this crash (even if there's a race or not). I would like to land this simple patch first, and see if it does anything to the crash numbers. As BillM said in comment 6, this is possibly caused by a race between child and parent, but I'd like to make sure before we uplift a pretty complicated patch.

Attachment #8802568 - Flags: review?(honzab.moz)

Assignee

Comment 41

•

8 years ago

Attached patch Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__() — Details — Splinter Review

* * *
[mq]: top

MozReview-Commit-ID: 4lb0gs32tOq

Attachment #8802569 - Flags: review?(josh)

Attachment #8802569 - Flags: review?(honzab.moz)

Assignee

Updated

•

8 years ago

Attachment #8801579 - Attachment is obsolete: true

Assignee

Comment 42

•

8 years ago

(In reply to Honza Bambas (:mayhemer) from comment #36)
> Instead of introducing a too general
> "async DeletingChannel()" I would add a one-purpose "async
> FinishInterceptedRedirect()" on *both* parent and child. 

I think we probably need a DeletingChannel as well. Otherwise, the child is doing releasing references, and the last one calls Send__delete__(), but the parent at the same time may Send another message which leads to this crash.

> true is that only referrer of the channel is at this point the ipc
> sub-channel.  but when a message comes from ipc (from the parent) that leads
> to exposition of this http child channel to someone that addrefs/releases it
> before ipc is actually closed, you will call SendDeletingChannel() again...
> 

This wasn't a problem because in RecvDeleteSelf we would indeed AddRef, but call Send__delete__() immediately. This sets mIPCOpen to false, so we don't end up calling SendDeletingChannel again. Actually, if we now have a RecvFinishInterceptingRedirect, we 

> 
> could it happen that (Recv)DeleteSelf() is being called from a different
> reason while the flag is still set?

I think it's possible. That's why I agree with your suggestion for a new message.
 
> > +    // XXX valentin: The interception logic should be rewritten to avoid
> > +    // doing AsyncOpen on the channel after the Send__delete__()
> 
> can you say why?  looks like this patch is trying to do it, isn't it?  or I
> don't understand this comment at all?

We still AsyncOpen the channel in FinishInterceptedRedirect, after calling Send__delete__(). Josh said he is addressing this in bug 1231222.

Comment 43

•

8 years ago

Comment on attachment 8802568 [details] [diff] [review]
Make sure to check mIPCClosed before calling SendRedirect1Begin

Review of attachment 8802568 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!  Let's try.  Thanks.

Attachment #8802568 - Flags: review?(honzab.moz) → review+

BugBot [:suhaib / :marco/ :calixte]

Assignee

Comment 44

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/73ad49f175814ebeb16b8fb7b09b3217540332a9
Bug 1294719 - Make sure to check mIPCClosed before calling SendRedirect1Begin r=honzab

Comment 45

•

8 years ago

Crash volume for signature 'IPCError-content | (msgtype=0x8E0017,name=???) Route error: message sent to unknown actor ID':
 - nightly (version 52): 184 crashes from 2016-09-19.
 - aurora  (version 51): 563 crashes from 2016-09-19.
 - beta    (version 50): 1346 crashes from 2016-09-20.
 - release (version 49): 0 crashes from 2016-09-05.
 - esr     (version 45): 0 crashes from 2016-07-25.

Crash volume on the last weeks (Week N is from 10-17 to 10-23):
            W. N-1  W. N-2  W. N-3  W. N-4
 - nightly      22      16      79      66
 - aurora      172     137     184      32
 - beta        322     366     447     136
 - release       0       0       0       0
 - esr           0       0       0       0

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
             Browser Content     Plugin
 - nightly           #52
 - aurora            #6
 - beta              #5
 - release
 - esr

status-firefox51: --- → affected

status-firefox52: --- → affected

Phil Ringnalda (:philor)

Comment 46

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/73ad49f17581

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox52: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla52

Assignee

Updated

•

8 years ago

Status: RESOLVED → REOPENED

Keywords: leave-open

Resolution: FIXED → ---

Updated

•

8 years ago

status-firefox48: affected → wontfix

status-firefox49: affected → wontfix

status-firefox52: fixed → affected

Assignee

Comment 47

•

8 years ago

(In reply to Honza Bambas (:mayhemer) from comment #43)
> Comment on attachment 8802568 [details] [diff] [review]
> Make sure to check mIPCClosed before calling SendRedirect1Begin
> 
> Review of attachment 8802568 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Nice!  Let's try.  Thanks.

Didn't fix it :(
https://crash-stats.mozilla.com/report/index/a6ea2d48-fc79-4747-a53a-fe3322161021

Comment 48

•

8 years ago

This may be a bit late, but still worth:

(In reply to Valentin Gosu [:valentin] from comment #42)
> (In reply to Honza Bambas (:mayhemer) from comment #36)
> > Instead of introducing a too general
> > "async DeletingChannel()" I would add a one-purpose "async
> > FinishInterceptedRedirect()" on *both* parent and child. 
> 
> I think we probably need a DeletingChannel as well. Otherwise, the child is
> doing releasing references, and the last one calls Send__delete__(), but the
> parent at the same time may Send another message which leads to this crash.

Comment 6 is a good summary of what's happening here, yes, we may need it.  I also start thinking you should have FinishInterceptedRedirect message and DeletingChannel message as well.

Although, RecvFinishInterceptedRedirect will *not* close the IPC channel.  It will just do any necessary parent/child looping related to interception and redirects, nothing more.  To close ipc you would use explicitly DeletingChannel call.  So that you would have in your code something like (schematically):

  SendFinishInterceptedRedirect();
  SendDeletingChannel();

or a like...

DeletingChannel will be available on both sides (parent and child) and stand for sending a FIN packet in a TCP connection - by means "I'm done sending IPC to you, if you are as well, just delete the channel".  A bit more work, but when it proves itself, it could become a new standard of closing IPC making the fragile mIPCClosed flag obsolete.

More in detail, have two flags, mLocalIPCClosed and mRemoteIPCClosed (or better names).  mLocalIPCClosed means: I want my side of IPC close and will never send more messages (checked before each Send*()).  mRemoteIPCClosed means you got RecvDeletingChannel().  When *both* the flags are set, just do Send__delete__().  You need a method that examines these flags and call either DeletingChannel() or Send__delete__() based on mRemoteIPCClosed.  DeletingChannel and Send__Delete__ will only ever be called from that method!  On places you will want to close the IPC you will call this new IPC close wrapping method.

That's my few architectural cents.

> 
> > true is that only referrer of the channel is at this point the ipc
> > sub-channel.  but when a message comes from ipc (from the parent) that leads
> > to exposition of this http child channel to someone that addrefs/releases it
> > before ipc is actually closed, you will call SendDeletingChannel() again...
> > 
> 
> This wasn't a problem because in RecvDeleteSelf we would indeed AddRef, but
> call Send__delete__() immediately. This sets mIPCOpen to false, so we don't
> end up calling SendDeletingChannel again. Actually, if we now have a
> RecvFinishInterceptingRedirect, we 

Didn't finish the comment? :)  I'm not sure I follow anyway.  If you are trying to persuade me about a simpler solution, please do.  Just in a way I understand it :)

> We still AsyncOpen the channel in FinishInterceptedRedirect, after calling
> Send__delete__(). Josh said he is addressing this in bug 1231222.

Aha.  Got it.

Kan-Ru Chen [:kanru] (UTC+9)

Assignee

Comment 49

•

8 years ago

I think you are completely right. However, this patch will probably have to be uplifted, and I really don't want to make too many changes. I'm aiming for the smallest amount of code change that doesn't change behaviour too much.
Do you think this patch is OK? I'd love to push this to m-c, to make sure it really fixes our problem.

(In reply to Honza Bambas (:mayhemer) from comment #48)
> Although, RecvFinishInterceptedRedirect will *not* close the IPC channel.

It does close it. Send__delete__() definitely closes. We reopen it by calling AsyncOpen.

> > This wasn't a problem because in RecvDeleteSelf we would indeed AddRef, but
> > call Send__delete__() immediately. This sets mIPCOpen to false, so we don't
> > end up calling SendDeletingChannel again. Actually, if we now have a
> > RecvFinishInterceptingRedirect, we 

I think I meant to say that RecvDeleteSelf doesn't actually need to keep a ref to the channel, since it doesn't call any other methods anymore.
RecvFinishInterceptingRedirect does need to keep a ref, but that's OK. The ref count will be 1 entering the call. When we addref it will be 2. Then we call Send__delete__() which leads to mIPCOpen=false, and the refcount goes to 1. Then we call AsyncOpen on the object, which calls AddIPDLReference and reopens the IPDL channel.

Flags: needinfo?(honzab.moz)

Updated

•

8 years ago

Crash Signature: ,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | (msgtype=0x8E0017,name=???) Route error: message sent to unknown actor ID] → ,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | (msgtype=0x8E0017,name=???) Route error: message sent to unknown actor ID] [@ IPCError-content | PHttpChannel::Msg_Redirect1Begin Route error: message sent to unknown actor …

Comment 51

•

8 years ago

Comment on attachment 8802569 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

Review of attachment 8802569 [details] [diff] [review]:
-----------------------------------------------------------------

locally tested with dom/workers/test/serviceworkers/test_openWindow.html.  code exercised, no crashes, no leaks.  but if this fixes the original reported crash, that's hard to say.

there are just few cosmetic changes to do, then land please.

::: netwerk/protocol/http/HttpChannelChild.cpp
@@ +1155,5 @@
>  
> +bool
> +HttpChannelChild::RecvFinishInterceptedRedirect()
> +{
> +  // Hold a ref to this to keep it from being deleted by Send__delete__()

here it makes sense to keep the ref...

@@ +1170,5 @@
>  void
>  HttpChannelChild::DeleteSelf()
>  {
> +  // Hold a ref to this to keep it from being deleted by Send__delete__()
> +  RefPtr<HttpChannelChild> self(this);

...but here it doesn't.  it will still delete before this method returns.  I removed this locally and it didn't crash.  if there is no reason (I don't see it, nor comment suggest any) please remove it.

@@ +1175,4 @@
>    Send__delete__(this);
>  }
>  
> +class OverrideRunnable : public Runnable {

should be in an anon namespace, but probably better would be declare it as a nested class of HttpChannelChild

@@ +1195,5 @@
> +  , mHead(aHead)
> +  {
> +  }
> +
> +  NS_IMETHOD Run() override {

{ on a new line

@@ +1234,5 @@
> +  }
> +
> +  if (mOverrideRunnable) {
> +    RefPtr<OverrideRunnable> override =
> +      static_cast<OverrideRunnable*>(mOverrideRunnable.get());

if OverrideRunnable is nested to httpchannelchild, you may keep a concrete type refptr.

@@ +1529,5 @@
>    LOG(("HttpChannelChild::Redirect3Complete [this=%p]\n", this));
>    nsresult rv = NS_OK;
>  
> +  nsCOMPtr<nsIHttpChannelChild> chan = do_QueryInterface(mRedirectChannelChild);
> +  RefPtr<HttpChannelChild> httpChannelChild = static_cast<HttpChannelChild*>(chan.get());

grrr... is there no other way than static cast?

Attachment #8802569 - Flags: review?(honzab.moz) → review+

Updated

•

8 years ago

Flags: needinfo?(honzab.moz)

Assignee

Comment 53

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/de41e684fb55d7e64a3eecf8ee0e4279f37e0724
Bug 1294719 - Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__() r=honzab

Andreas Tolfsen ❲:ato❳

Comment 54

•

8 years ago

As a matter of reproduction, I quite often experience this on Linux x86_64 loading guardian.co.uk.

Comment 55

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/de41e684fb55

Josh Matthews [:jdm]

Comment 56

•

8 years ago

Is my review here still required since this merged?

Assignee

Comment 57

•

8 years ago

(In reply to Josh Matthews [:jdm] from comment #56)
> Is my review here still required since this merged?
Not really. :) Your initial review was very helpful. But let me know if you saw something that seems wrong in the patch.

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Assignee

Comment 58

•

8 years ago

Comment on attachment 8802569 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

Approval Request Comment
[Feature/regressing bug #]: Preexisting bug in e10s implementation.
[User impact if declined]: A race condition may cause crashes when the parent process sends a message after the child has tried to close the IPC channel. Most commonly happens with redirects.
[Describe test coverage new/current, TreeHerder]: We have unit tests checking that redirects and service workers still work properly.
[Risks and why]: This changes the way we shut down the IPC channel. We try to keep behaviour just the same. The risk is low, but not zero.
[String/UUID change made/needed]: None.

Attachment #8802569 - Flags: review?(josh)

Attachment #8802569 - Flags: approval-mozilla-beta?

Attachment #8802569 - Flags: approval-mozilla-aurora?

Comment 59

•

8 years ago

Comment on attachment 8802569 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

Let's stabilize this change on Aurora51 and then decide whether it's sufficiently low risk to include in a 50 RC build or not.

Attachment #8802569 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 60

•

8 years ago

Comment on attachment 8802569 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

The data from Nightly seems promising on this one. I am taking a big leap of faith on this one, taking it in 50 RC1. If things look worse, we will plan a backout. Aurora51+, Beta50+

Attachment #8802569 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 61

•

8 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/058c4a1cd78b

status-firefox51: affected → fixed

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 62

•

8 years ago

This is hitting conflicts applying to beta.

Flags: needinfo?(valentin.gosu)

Assignee

Comment 63

•

8 years ago

Attached patch [beta] Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__() — Details — Splinter Review

MozReview-Commit-ID: 4lb0gs32tOq

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Assignee

Comment 64

•

8 years ago

Rebased on beta.

Flags: needinfo?(valentin.gosu) → needinfo?(wkocher)

Comment 65

•

8 years ago

Hi Jim, Erin, going back on the history of this bug, we decided to disable e10s on RU locale to work around this problem. Is e10s going to be disabled for RU users in 50 as well? In that case, I would prefer not to uplift this fix to Beta50. The crash rates on 50 are finally looking good and I'd hate to disrupt that.

Flags: needinfo?(jmathies)

Flags: needinfo?(elancaster)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 66

•

8 years ago

Comment on attachment 8802569 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

Hi Wes, Tomcat, please hold off on uplifting this to Beta50. I am a bit unclear on the severity of the issue and feel like it might be too late to take this in 50.

Attachment #8802569 - Flags: approval-mozilla-beta+ → approval-mozilla-beta?

Comment 67

•

8 years ago

Disabling e10s for RU locales made it better, but it's still the #2 top content crasher on 49: https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=49.0.2&days=7&process_type=content.

In 50 it's a bit better, it is the #6: https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=50.0b&days=7&process_type=content.

Reporter

Comment 68

•

8 years ago

(In reply to Ritu Kothari (:ritu) from comment #65)
> Hi Jim, Erin, going back on the history of this bug, we decided to disable
> e10s on RU locale to work around this problem. Is e10s going to be disabled
> for RU users in 50 as well? In that case, I would prefer not to uplift this
> fix to Beta50. The crash rates on 50 are finally looking good and I'd hate
> to disrupt that.

We want e10s on for russian users before 51 lands in release so we can either do that through an uplift now (necko engineers should comment on how comfortable they are with a late 50 uplift) or through an eventual 50 point release.

Flags: needinfo?(jmathies)

Erin Lancaster [:elan]

Updated

•

8 years ago

Flags: needinfo?(elancaster)

Assignee

Comment 69

•

8 years ago

Of course I'd be more comfortable with doing a point release, if it has a few weeks to bake on beta, but a late uplift shouldn't be too risky. Also, we should probably coordinate it with re-enabling e10s for Russian users.

Updated

•

8 years ago

status-firefox52: affected → fixed

Assignee

Comment 70

•

8 years ago

Based on nightly crash reports, we can safely say this bug is fixed.

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

Resolution: --- → FIXED

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

8 years ago

Keywords: leave-open

Comment 71

•

8 years ago

Thanks Jim and Valentin. Let's not uplift this in 50 but rather in the planned 50.1.0 release. By then this fix will have baked on Aurora51/Beta51 for a few weeks.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Updated

•

8 years ago

Flags: needinfo?(wkocher)

Reporter

Updated

•

8 years ago

Attachment #8802569 - Flags: approval-mozilla-beta?

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Reporter

Comment 72

•

8 years ago

Comment on attachment 8802569 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

Flagging for potential inclusion into the december point release.

Approval Request Comment
[Feature/regressing bug #]:
regression caused by e10s / necko code
[User impact if declined]:
e10s is disabled for russian users due to this, we want to get those users turned back on.
[Describe test coverage new/current, TreeHerder]:
patch landed on nightly/aurora.
[Risks and why]: 
medium, invasive necko patch, but pretty well tested in current releases that have the fix.
[String/UUID change made/needed]:
none

Attachment #8802569 - Flags: approval-mozilla-release?

Comment 74

•

8 years ago

Comment on attachment 8802569 [details] [diff] [review]
Make sure HttpChannelParent stops sending IPC messages before calling Send__delete__()

This was a top crasher in 49,50. It has stabilized for a few weeks on pre-release channels, let's include it in 50.1.0

Attachment #8802569 - Flags: approval-mozilla-release? → approval-mozilla-release+

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 75

•

8 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-release/rev/533c2e8ef2ab
https://hg.mozilla.org/releases/mozilla-release/rev/1c4dcc66d27a

status-firefox50: affected → fixed

Comment 76

•

7 years ago

Hi Valentin, Jim, this fix hasn't worked well on the release channel it seems. I see an increase in crash rates on 50.1.0 release and Philipp mentioned that this bug might be a potential reason. Can you please investigate? Hoping that by the time Fx51 will go-live mid-Jan we will have an updated fix.

Flags: needinfo?(valentin.gosu)

Flags: needinfo?(jmathies)

[:philipp]

Comment 77

•

7 years ago

sorry for being too ambiguous - the signature is spiking up http://bit.ly/2h2NYFU most probably due to some third-party website change (ad networks?), but only on 50.0/50.0.1/50.0.2 - so the fix we included in 50.1 seems to do it's job!

Flags: needinfo?(valentin.gosu)

Flags: needinfo?(jmathies)

Updated

•

4 years ago

Regressions: 1651661