Closed Bug 1289001 Opened 3 years ago Closed 3 years ago

Crash in IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b

Categories

(Core :: Networking: HTTP, defect, P1, critical)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox50 + wontfix
firefox51 + fixed
firefox52 --- fixed

People

(Reporter: njn, Assigned: baku)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [necko-active][OA])

Crash Data

Attachments

(5 files, 6 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-1f2056b1-5998-465e-bdc3-46ed22160725.
=============================================================

New crash, first seen in Nightly 20160722030235 and it is the #9 topcrash in that build. It has occurred 36 times across multiple installations, on both Windows and Mac.

In all the crash reports thread 0 is reported as the crashing thread, but the stack trace varies widely across the different crash report and so appears to be unrelated. The processor notes field says 'Signature replaced with an IPC Channel Error, was: "<func>"'.

The full abort message is "Value error: message was deserialized, but contained an illegal value".

I'm not sure how to interpret this one. mrbkap, any ideas from the IPC side? Valentin, any ideas from the Necko side?
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(mrbkap)
I'm not entirely sure, but I believe this means that one of the parameters (from the child to the parent) failed to unwrap during the IPC call. I glanced through checkins to netwerk/ from July 21-today and didn't see anything obvious, but I didn't look too closely so I could have missed something. This could have also meant that AllocPHttpChannelParent failed, but I'm not sure why that would be (one thing that can cause it to fail is if we're trying to create an HTTP channel from a content process that doesn't have a tab parent, is anybody trying that yet?).
Flags: needinfo?(mrbkap)
OS: Windows 10 → All
(In reply to Ludovic Hirlimann [:Usul] from comment #2)
> url to reproduce :
> http://c.ymcdn.com/sites/hmgs.site-ym.com/resource/resmgr/Historicon/
> HC16_PEL_05_13.pdf

I tried and failed to reproduce on three different machines (Windows, Mac, Linux) all running Nightly.

Were there any specific steps to reproduce with this URL? I just scrolled slowly through the PDF to the bottom.
Flags: needinfo?(ludovic)
(In reply to Nicholas Nethercote [:njn] from comment #3)
> (In reply to Ludovic Hirlimann [:Usul] from comment #2)
> > url to reproduce :
> > http://c.ymcdn.com/sites/hmgs.site-ym.com/resource/resmgr/Historicon/
> > HC16_PEL_05_13.pdf
> 
> I tried and failed to reproduce on three different machines (Windows, Mac,
> Linux) all running Nightly.
> 
> Were there any specific steps to reproduce with this URL? I just scrolled
> slowly through the PDF to the bottom.

current nightly doesn't repro. :(
Flags: needinfo?(ludovic)
Whiteboard: [necko-next]
Hi Ludovic, can you try to repro using Aurora50? This is a top content crasher on the Aurora channel/
Flags: needinfo?(ludovic)
(In reply to Ritu Kothari (:ritu) from comment #5)
> Hi Ludovic, can you try to repro using Aurora50? This is a top content
> crasher on the Aurora channel/

I was just loading the url from comment 2 when I was crashing for sure. this doesn't work anymore :(
Flags: needinfo?(ludovic)
Crash volume for signature 'IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b':
 - nightly (version 51): 704 crashes from 2016-08-01.
 - aurora  (version 50): 2103 crashes from 2016-08-01.
 - beta    (version 49): 0 crashes from 2016-08-02.
 - release (version 48): 0 crashes from 2016-07-25.
 - esr     (version 45): 0 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly     232     178     136
 - aurora      865     609     174
 - beta          0       0       0
 - release       0       0       0
 - esr           0       0       0

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
             Browser Content     Plugin
 - nightly           #5
 - aurora            #3
 - beta
 - release
 - esr
Hi Bill, is this content crash similar or same as the browser crash you fixed in bug 1277582?
Flags: needinfo?(wmccloskey)
STR:

 1. Open latest aurora.
 2. Set the pref "privacy.userContext.enabled" and "privacy.userContext.ui.enabled" set on true.
 3. Restart the browser .
 4. Press alt and from File->New Container Tab->Work.
 5. Open the console and only activate .Net>XHR.
 6. Go to www.cnn.com and click a news.

Notes: 

a. The machines had Intel Graphics.
b. e10s was activated.
c. This occured on 3 differite machines, 2 on Windows 10 and one Windows 7.
d. This happens with all the container tabs.

Crash reports:

https://crash-stats.mozilla.com/report/index/76b9b756-c5b8-490c-8cbb-3e2132160915
https://crash-stats.mozilla.com/report/index/432d70a5-c0f3-4d86-99e0-677052160915
https://crash-stats.mozilla.com/report/index/98ff4e8e-c80f-4fad-91e3-5f7312160915
I've just hit this crash on Linux, when I closed a Google docs page which was in a separate container: https://crash-stats.mozilla.com/report/index/11a476a5-a067-49bc-b119-485682160919.
From what I can tell deserialization of SerializedLoadContext fails because return
> aResult->mOriginAttributes.PopulateFromSuffix(suffix);
returns false.
This bug was filed right after bug 1295103 landed.

I'm not exactly sure why the call fails. Kan-Ru, do you have time to take a look at this?
Flags: needinfo?(valentin.gosu) → needinfo?(kchen)
Were you able to reproduce this Valentin? I wasn't. Based on what I can figure out, it seems like AllocPHttpChannelParent might be returning null. That's the only place I can find where we return MsgValueError without first calling FatalError (which will crash the parent).

However, your comment suggests that one of the Read functions on the arguments is failing. If so, I wonder why we're not crashing first in FatalError?
Flags: needinfo?(wmccloskey) → needinfo?(valentin.gosu)
(In reply to Bill McCloskey (:billm) from comment #13)
> Were you able to reproduce this Valentin? I wasn't. 

No, I haven't. I was basing my analysis on the crash reports.

> Based on what I can
> figure out, it seems like AllocPHttpChannelParent might be returning null.
> That's the only place I can find where we return MsgValueError without first
> calling FatalError (which will crash the parent).
> 
> However, your comment suggests that one of the Read functions on the
> arguments is failing. If so, I wonder why we're not crashing first in
> FatalError?

You're right, if it were the deserialization it should have called FatalError.
It is however possible for AllocPHttpChannelParent to return nullptr if NeckoParent::CreateChannelLoadContext returns an error string because in NeckoParent::GetValidatedAppInfo aSerialized is "null".
That means the problem would be on the child side - so in SerializedLoadContext(nsIChannel*), NS_QueryNotificationCallbacks turns up a null load context. I'm not sure why that is, or what we should do in that case.

Jason, any idea what the behaviour is supposed to be in this case?
Flags: needinfo?(valentin.gosu) → needinfo?(jduell.mcbugs)
So for B2G apps we were quite strict--if a channel from the child didn't have a LoadContext (SerializedLoadContext, to be precise), we killed it the NeckoParent security checks.  That logic is still there.

So this sounds like a programming error in the child somewhere.  Something is creating a channel without giving it a LoadInfo in its callbacks.
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(kchen)
[Tracking Requested - why for this release]:
this signature is regressing since firefox 50 builds and is currently the #1 crasher in the content process with 12% of all crashes in 50.0b1 crash data.
Keywords: regression
Duplicate of this bug: 1284249
(In reply to Jason Duell [:jduell] (needinfo me) from comment #15)
> So for B2G apps we were quite strict--if a channel from the child didn't
> have a LoadContext (SerializedLoadContext, to be precise), we killed it the
> NeckoParent security checks.  That logic is still there.
> 
> So this sounds like a programming error in the child somewhere.  Something
> is creating a channel without giving it a LoadInfo in its callbacks.

Ehsan, is this still the case? It seems to me that in the SerializedLoadContext constructor, we still handle the case when !loadContext. Should we change the HttpChannelParent constructor so that it allows a null loadContext?
Thanks!
comment 18
Flags: needinfo?(ehsan)
(In reply to Valentin Gosu [:valentin] from comment #18)
> (In reply to Jason Duell [:jduell] (needinfo me) from comment #15)
> > So for B2G apps we were quite strict--if a channel from the child didn't
> > have a LoadContext (SerializedLoadContext, to be precise), we killed it the
> > NeckoParent security checks.  That logic is still there.
> > 
> > So this sounds like a programming error in the child somewhere.  Something
> > is creating a channel without giving it a LoadInfo in its callbacks.
> 
> Ehsan, is this still the case?

If you mean, do we still have the code to kill the child when invalid stuff is found in Necko IPC, then the answer is yes.  This is controlled by the network.disable.ipc.security pref, which is true on desktop (which means, the said security checks are bypassed.)

I think this is completely irrelevant to this bug though, FWIW.

> It seems to me that in the
> SerializedLoadContext constructor, we still handle the case when
> !loadContext. Should we change the HttpChannelParent constructor so that it
> allows a null loadContext?
> Thanks!

I don't understand what you mean here.  A SerializedLoadContext can be created from an nsILoadContext* or from other things, including nullptr, where we won't have a load context pointer, but that's fine.  That's why the branch handling a null load context exists.



Now, let me give a shot at debugging this.  :-)

Comment 10 says that this happens with the containers turned on.  Containers are implemented by the userContextId field on OriginAttributes.  In comment 12, you said that this is happening because PopulateFromSuffix() is returning false, which makes sense if the bug has something to do with containers.  Note that there are a couple of cases where parsing "userContextId" would fail, for example if its value isn't an integer or is above the uint32_t range.  Now, there is some OA manipulation happening inside NeckoParent::GetValidatedAppInfo(), but the thing that looks suspect is this code: <http://searchfox.org/mozilla-central/source/netwerk/ipc/NeckoParent.cpp#110,143>  Note how we're trying to match the userContextIds, and if we get a mismatch, we'll end up here <http://searchfox.org/mozilla-central/source/netwerk/ipc/NeckoParent.cpp#157> returning an error which would crash with the symptoms you're seeing here.  I bet this is the bug.

Now the $1,000,000 question is: how do we end up with a mismatching user context ID.  Obviously if we managed to reproduce this, we can easily debug what's happening.  But assuming that we can't (I'm assuming you've been testing with loading stuff in tabs from different containers, which is how you'd trigger the code in question...), we need to come up with theories on what could cause this.  For example, could we end up in a situation where the TabContext's notion of its OriginAttributes is out of date with the child process?

I'm CCing baku and tanvi who are familiar with the container feature.
Flags: needinfo?(ehsan)
Tracking this top crasher as a blocking issue for Fx50.
Hi Baku, Tanvi, this is a top crasher on Beta50. Ehsan has some questions in comment 20. I hope you can help root cause and we can have a potential fix ready soon. Thanks!
Flags: needinfo?(tanvi)
Flags: needinfo?(amarchesini)
(In reply to Ritu Kothari (:ritu) from comment #22)
> Hi Baku, Tanvi, this is a top crasher on Beta50. Ehsan has some questions in
> comment 20. I hope you can help root cause and we can have a potential fix
> ready soon. Thanks!

My questions were actually directed at Valentin.  Needinfoing him instead.
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(tanvi)
Flags: needinfo?(amarchesini)
mozilla::ipc::ProcessLink::SendMessageW on 49 is probably due to the same bug.
I see the following correlations:
(100.0% in signature vs 00.09% overall) moz_crash_reason = MOZ_CRASH(IPC message size is too large)
(100.0% in signature vs 39.88% overall) reason = EXCEPTION_BREAKPOINT
(42.94% in signature vs 00.03% overall) ipc_message_name = PContent::Reply_ReadPrefsArray
(40.59% in signature vs 00.03% overall) ipc_message_name = PNecko::Msg_PHttpChannelConstructor
(42.94% in signature vs 09.30% overall) useragent_locale = ru
(30.59% in signature vs 02.55% overall) Addon "McAfee SiteAdvisor" = true
(38.24% in signature vs 00.03% overall) ipc_message_name = PNecko::Msg_PHttpChannelConstructor ∧ useragent_locale = ru

The 40% of the reports that have (ipc_message_name = PNecko::Msg_PHttpChannelConstructor ∧ useragent_locale = ru) are really similar to this bug.

I will file a new bug for the other ipc_message_name.

(In 50 and up, that signature is related to bug 1288997)
Crash Signature: [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b] → [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b] [@ mozilla::ipc::ProcessLink::SendMessageW ]
See Also: → 1306331
So, looks like 49 is affected, but with a smaller volume and a different signature (unless it is a different crash).

The McAfee SiteAdvisor addon is correlated with the crash with the other ipc_message_name ((30.59% in signature vs 00.02% overall) Addon "McAfee SiteAdvisor" = true ∧ ipc_message_name = PContent::Reply_ReadPrefsArray).
Actually, the crash caused by a too large IPC message with PNecko::Msg_PHttpChannelConstructor is tracked in bug 1277681.
Crash Signature: [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b] [@ mozilla::ipc::ProcessLink::SendMessageW ] → [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b]
See Also: → 1277681
Crash volume for signature 'IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b':
 - nightly (version 52): 364 crashes from 2016-09-19.
 - aurora  (version 51): 660 crashes from 2016-09-19.
 - beta    (version 50): 4279 crashes from 2016-09-20.
 - release (version 49): 0 crashes from 2016-09-05.
 - esr     (version 45): 0 crashes from 2016-06-01.

Crash volume on the last weeks (Week N is from 10-03 to 10-09):
            W. N-1  W. N-2
 - nightly     223     141
 - aurora      563      97
 - beta       3398     881
 - release       0       0
 - esr           0       0

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
             Browser Content     Plugin
 - nightly           #7
 - aurora            #4
 - beta              #1
 - release
 - esr
Assignee: nobody → valentin.gosu
Flags: needinfo?(valentin.gosu)
Whiteboard: [necko-next] → [necko-active]
Since we weren't able to reproduce this issue with the STR, I want push a diagnostic patch first, to figure out what's going on.
Attachment #8798043 - Flags: review?(jduell.mcbugs)
Attachment #8798043 - Flags: review?(jduell.mcbugs) → review+
https://hg.mozilla.org/mozilla-central/rev/871153c1b8b2
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla52 → ---
Comment on attachment 8798043 [details] [diff] [review]
bug1289001-diagnose.patch

Review of attachment 8798043 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/ipc/NeckoParent.cpp
@@ +113,5 @@
> +#ifndef RELEASE_BUILD
> +  MOZ_CRASH_ANNOTATE(reason);
> +  MOZ_REALLY_CRASH();
> +#endif
> +}

FWIW, we have MOZ_DIAGNOSTIC_ASSERT for this exact purpose.  You didn't need to roll your own here.  :-)
Duplicate of this bug: 1308314
Crash Signature: [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b] → [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b] [@ mozilla::net::CrashWithReason ]
So the debug code has some interesting results.

The following pairs represent us trying to match the first field aSerialized.mOriginAttributes.mUserContextId to the second field tabContext.OriginAttributesRef().mUserContextId:

https://crash-stats.mozilla.com/report/index/533e3552-3d5b-4c60-a8fb-905c62161007
-(5,0)(5,0)

https://crash-stats.mozilla.com/report/index/509cac1f-8555-47d6-873d-ec4d22161006
-(0,2)(0,2)(0,2)(0,2)(0,2)

It seems that we're both trying to trying to create a channel with a non-zero userContextId in a tab with a zero ContextId, and vice-versa.
Crash Signature: [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b] [@ mozilla::net::CrashWithReason ] → [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b] [@ mozilla::net::CrashWithReason ] [@ mozilla::net::NeckoParent::GetValidatedAppInfo ]
FYI - I got crash in [@ mozilla::net::CrashWithReason ] on 52.0a1 [e10s enabled] after closing page which was still loading images

Crashlog report - https://crash-stats.mozilla.com/report/index/0580288c-4f4a-4368-ab37-7801b2161008
(In reply to Virtual_ManPL [:Virtual] - (ni? me) from comment #35)
> FYI - I got crash in [@ mozilla::net::CrashWithReason ] on 52.0a1 [e10s
> enabled] after closing page which was still loading images
> Crashlog report -
> https://crash-stats.mozilla.com/report/index/0580288c-4f4a-4368-ab37-7801b2161008

Thanks for the report. It seems we don't only crash for container tabs.
GetValidatedAppInfo | ContentParent does not have any PBrowsers
I'm just trying to wrap my head around this. It seems most of the crashes occur for (0,5) or (5,0) pairs. What do we use contextId 5 for? Tanvi, do you know if this is for an addon or something in our code?

There are a few for contextIDs 1,2,3,4 as well, but fewer. I assume might be some kind of race between opening channels and closing windows with a certain TabContext.
Flags: needinfo?(wmccloskey)
Flags: needinfo?(tanvi)
Which would suggest this is a regression from bug 1279568.  At least now we have something to back out!
Blocks: 1279568
I know what userContextId == 5 is: this is a private container used to generate thumbnails for about:newtab.
At the beginning the idea was to use privateBrowsing to load this thumbnail for security reasons, but it's not currently possible to have privateBrowserId > 1.
I still think we have an IPC problem here. (0,2) pairs have also crashed the browser.
For now, we have 0 reports on Beta 50.0b6. I wonder if the signature has changed.
Backing out 1279568 will solve the (0,5) mismatches.  We should back that out of Beta now.  But I don't know why we'd be seeing (0,1-4) mismatches.  Containers isn't enabled on Beta.  So users would have to explicitly go into about:config and enable them.

Test Pilot for Containers will probably launch in Firefox 50 though.  In which case, this could become a Test Pilot crash issue.

We need to figure out why we have the mismatch in the first place.
For Bug 1279568, instead of backing out, maybe we just need to make it conditional on the privacy.usercontext.enabled pref.
Filed bug 1309699 to track the back out mentioned in comment 39 and 44.
Depends on: 1309699
Is the crash only an issue for debug builds?  Talking to baku earlier, it sounded like it was an assertion failure that would only effect debug builds.  But given the number of crashes, it must be a non-debug issue too.
(In reply to Tanvi Vyas [:tanvi] from comment #46)
> Is the crash only an issue for debug builds?  Talking to baku earlier, it
> sounded like it was an assertion failure that would only effect debug
> builds.  But given the number of crashes, it must be a non-debug issue too.

No, this was a MOZ_CRASH which turned into a manual crash in Valentin's debugging patch.  See <http://searchfox.org/mozilla-central/source/netwerk/ipc/NeckoParent.cpp#185>
This manual assertion was introduced by this patch. Do we want to keep it or it's just for debugging reasons?
(In reply to Andrea Marchesini [:baku] from comment #48)
> This manual assertion was introduced by this patch. Do we want to keep it or
> it's just for debugging reasons?

No, it was there before.

Before the debug patch, we'd return the string here: <https://hg.mozilla.org/mozilla-central/rev/871153c1b8b2#l1.91> which would have been returned from <http://searchfox.org/mozilla-central/rev/2142de26c16c05f23e543be4fa1a651c4d29604e/netwerk/ipc/NeckoParent.cpp#212> which would cause us to return null from the AllocPHttpChannelParent function <http://searchfox.org/mozilla-central/rev/2142de26c16c05f23e543be4fa1a651c4d29604e/netwerk/ipc/NeckoParent.cpp#263> which would cause us to return MsgValueError from NeckoParent::OnMessageReceived() which would flow through the IPC machinery to get to mozilla::ipc::FatalError <http://searchfox.org/mozilla-central/rev/d96317a351af8aa78ab9847e7feed964bbaac7d7/ipc/glue/ProtocolUtils.cpp#292> which would MOZ_CRASH.
Ritu, did 1309699 reduce the number of beta crashers?

Valentin or baku, any progress on finding the root cause?
Flags: needinfo?(rkothari)
(In reply to Tanvi Vyas [:tanvi] from comment #50)
> Ritu, did 1309699 reduce the number of beta crashers?

curiously the content crashes with this signature already ceased in 50.0b6, whereas the backout from bug 1309699 should have just happened in beta 7.
https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_50_0b5_RELEASE&tochange=FIREFOX_50_0b6_RELEASE
Flags: needinfo?(rkothari)
Duplicate of this bug: 1307028
(In reply to [:philipp] from comment #51)
> (In reply to Tanvi Vyas [:tanvi] from comment #50)
> > Ritu, did 1309699 reduce the number of beta crashers?
> 
> curiously the content crashes with this signature already ceased in 50.0b6,
> whereas the backout from bug 1309699 should have just happened in beta 7.
> https://hg.mozilla.org/releases/mozilla-beta/
> pushloghtml?fromchange=FIREFOX_50_0b5_RELEASE&tochange=FIREFOX_50_0b6_RELEASE

I think some IPDL changes made the signature change.
Crash Signature: [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b] [@ mozilla::net::CrashWithReason ] [@ mozilla::net::NeckoParent::GetValidatedAppInfo ] → [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b] [@ IPCError-browser | (msgtype=0xB00005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b] [@ mo…
(In reply to Valentin Gosu [:valentin] from comment #53)
> (In reply to [:philipp] from comment #51)
> > (In reply to Tanvi Vyas [:tanvi] from comment #50)
> > > Ritu, did 1309699 reduce the number of beta crashers?
> > 
> > curiously the content crashes with this signature already ceased in 50.0b6,
> > whereas the backout from bug 1309699 should have just happened in beta 7.
> > https://hg.mozilla.org/releases/mozilla-beta/
> > pushloghtml?fromchange=FIREFOX_50_0b5_RELEASE&tochange=FIREFOX_50_0b6_RELEASE
> 
> I think some IPDL changes made the signature change.

Actually, I don't see any more crash reports for the other signature either. If it wasn't the backout that did it, it might have been bug 1277681. Or maybe the signature has changed in another odd way.
The root cause of this bug has still not been identified.  Backing out about:newtab Cookie segregation in https://bugzilla.mozilla.org/show_bug.cgi?id=1309699 helped reduce the number of crashes we see, but the underlying problem still exists.

We don't have a reproducible test case, which makes finding the issue very difficult.  From what I understand, it has to do with a race condition when opening or closing a tab.

If we don't fix this in Firefox 50 Beta, the Test Pilot experiment for Containers (which will go into release 50 at the end of Q4) is at risk.  This is becasue the experiment will enable Containers and hence make this race condition more prevalent again (the way about:newtab did).
Priority: -- → P1
Whiteboard: [necko-active] → [necko-active][OA]
I've just noticed that I can reproduce this bug pretty consistently (the 'GetValidatedAppInfo | App does not have permission -(0,2)(0,2)' case) by simply opening a PDF (pdf.js) in a container tab.
See Also: → 1305380
(In reply to Valentin Gosu [:valentin] from comment #54)
Note that even though there are no more crashes on beta, Nightly is still hitting the debug code.
(In reply to Marco Castelluccio [:marco] from comment #56)
> I've just noticed that I can reproduce this bug pretty consistently (the
> 'GetValidatedAppInfo | App does not have permission -(0,2)(0,2)' case) by
> simply opening a PDF (pdf.js) in a container tab.

Can you specify the STR? I wasn't able to reproduce the crash with any of the PDF files I tried, on either Linux or Windows.
Flags: needinfo?(mcastelluccio)
I can't reproduce it as reliably as yesterday, not sure why.
I was able to reproduce by simply searching for a PDF on Google and opening it in another tab with a middle click.
Flags: needinfo?(mcastelluccio)
(In reply to Marco Castelluccio [:marco] from comment #59)
> I can't reproduce it as reliably as yesterday, not sure why.
> I was able to reproduce by simply searching for a PDF on Google and opening
> it in another tab with a middle click.

Exactly what I experienced with comment 2. Maybe some cache issues ?
I've managed to reproduce this crash pretty consistently (6+ times)!! Hopefully this is what we're looking for... I've added as much information as possible.

Process 45705 stopped
* thread #1: tid = 0x471ccd, 0x0000000101e0fd2e XUL`mozilla::net::CrashWithReason(reason="GetValidatedAppInfo | App does not have permission -(0,5)") + 25 at NeckoParent.cpp:108, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000101e0fd2e XUL`mozilla::net::CrashWithReason(reason="GetValidatedAppInfo | App does not have permission -(0,5)") + 25 at NeckoParent.cpp:108
   105 	{
   106 	#ifndef RELEASE_OR_BETA
   107 	  MOZ_CRASH_ANNOTATE(reason);
-> 108 	  MOZ_REALLY_CRASH();
   109 	#endif
   110 	}
   111 	

Example of the STR being used to reproduce the crash:
* https://youtu.be/KDSdHoUm6Vk (attached a crash stack as well)

Perquisites:

* You'll need to create a slack team via https://slack.com/create (it should only take a few seconds to get one up and running)
* upload several PDF's files into one of the slack channels (I used #general in my testing)

STR: (see the above video for an example)

* open the latest version of m-c using ./mach run --debug
* login into the slack team that you've created earlier in a container (I used the personal container)
* click on one of the PDF's that you've uploaded, while the PDF is opening, quickly open a new tab via the keyboard shortcut and close both the PDF that's being loaded and the new tab that you've just created (this needs to be done very quickly)
* repeating the steps will eventually cause the above crash (sometimes it's instant and sometimes it takes 10-20 times)

Once you get the above crash using the above STR, you'll be able to reproduce the crash once again via:

* once the crash occurs, close m-c via the terminal (lldb exit)
* re-open the same browser via ./mach run --debug
* open a new tab and close about:sessionrestore (don't restore)
* leave the new tab opened for a minute or so and the browser will crash

I'm assuming it's trying to load a thumbnail from the previous session which causes the crash without the user doing anything. This might explain why the users are seeing the crash happen a few times in a row and then not being able to reproduce.
Attached file stack.txt
I think I know what the problem is. Here what I have found so far:

1. When we load a PDF, PDF.js does the main http request using the originAttribute of the current window/tab. This means that, if the tab is a container tab, OriginAttributes will be correctly set in the HTTP request.

2. But PDFs are complex things and sometimes, also based on what the server supports, we need to do extra requests. Those are called "HTTP range requests". For how PDF.js is written, those additional requests are done as XHR, in a sandbox, directly by the pdfjs.jsm. So, they run as SystemPrincipal.

3. SystemPrincipal has a 'default' OriginAttributes (with 0 values for each key - mUserContextId is 0, mPrivateBrowsing is 0, etc...).
When HttpChannelChild does:  SendPHttpChannelConstructor(..., IPC::SerializedLoadContext(this)...) we serialize this 'default' OriginAttributes.

4. The parent process receives the HttpChannelParentConstructor call. It deserialize the LoadContext and it finds an OriginAttributes with 0 mUserContextId, 0 mPrivateBrowsing, - the 'default' OriginAttributes. Then, it compares this OA with any existing tabContext. Often we have a 'normal' tab opened, so everything is fine -luckily-. But if only container tabs are opened and/or only about:newtabs (because of the userContextId 5 used for thumbnails), we crash.

This seems a reasonable explanation of what happens.

Now, the fix for this issue must be done correctly, and first, I need a feedback from Valentin, or some other necko guy because, it seems to me that we don't support httpChannel actors if they are created by JSM (or anything else using systemPrincipal) in the child process: the comparison is based on the OA of the tab only.
Flags: needinfo?(valentin.gosu)
An easy fix would be to add the full principal instead just OriginAttributes in the serialization of loadContext.
Then the OA comparison is done only if the principal is not the systemPrincipal. But this makes our security model weaker :/
Long term plan is to modify pdf.js in order to use the content Principal everywhere.
(In reply to Andrea Marchesini [:baku] from comment #64)
> An easy fix would be to add the full principal instead just OriginAttributes
> in the serialization of loadContext.
> Then the OA comparison is done only if the principal is not the
> systemPrincipal. But this makes our security model weaker :/
> Long term plan is to modify pdf.js in order to use the content Principal
> everywhere.

I think we should do this at least for Firefox 50.  Then for 51+ we can try the full fix and update the pdf code.
Attached patch crash_necko.patch (obsolete) — Splinter Review
Attachment #8805755 - Flags: review?(valentin.gosu)
Comment on attachment 8805755 [details] [diff] [review]
crash_necko.patch

Review of attachment 8805755 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for digging into this.

::: netwerk/ipc/NeckoParent.cpp
@@ +131,5 @@
> +  // If serialized is null, we cannot validate anything. We have to assume that
> +  // this requests comes from a SystemPrincipal.
> +  if (!aSerialized.IsNotNull()) {
> +    return nullptr;
> +  }

Shouldn't you also update aAttrs with the Docshell origin attributes? See the end of this method.
I don't know if we need to worry about  lowering security here UsingNeckoIPCSecurity() returns false anyway. :)
Attachment #8805755 - Flags: review?(valentin.gosu) → review-
Attachment #8805755 - Attachment is obsolete: true
Attachment #8805792 - Flags: review?(valentin.gosu)
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check

Review of attachment 8805792 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great. Thanks!
Please nominate for uplifting after landing.
Attachment #8805792 - Flags: review?(valentin.gosu) → review+
(In reply to Valentin Gosu [:valentin] from comment #69)
> Looks great. Thanks!
> Please nominate for uplifting after landing.

Also I'm not sure this fixes the container bug, as we still fall through when !contextArray.IsEmpty() and aSerialized.IsNotNull()
Flags: needinfo?(valentin.gosu) → needinfo?(amarchesini)
Baku, let me know if you need any help testing this patch.
(In reply to Kamil Jozwiak [:kjozwiak] from comment #71)
> Baku, let me know if you need any help testing this patch.

I think this patch fixes the crash, but, as you know, it's hard to reproduce the issue. If you want to spend some times on monday, or when you want, this would be great.
Flags: needinfo?(amarchesini)
Assignee: valentin.gosu → amarchesini
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4a6cd6dff941
Security check in NeckoParent::GetValidatedAppInfo must pass if the serialized LoadContext is null, r=valentin
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check

Approval Request Comment
[Feature/regressing bug #]: IPC necko http actors
[User impact if declined]: The security check is wrongly done
[Describe test coverage new/current, TreeHerder]: super racy. a test cannot be written.
[Risks and why]: The patch is super simple: the check !aSerialized.IsNotNull() was missing in NeckoParent::GetValidatedAppInfo. The patch looks bigger because I moved some if() statements around.
[String/UUID change made/needed]: none
Attachment #8805792 - Flags: approval-mozilla-beta?
Attachment #8805792 - Flags: approval-mozilla-aurora?
> Also I'm not sure this fixes the container bug, as we still fall through
> when !contextArray.IsEmpty() and aSerialized.IsNotNull()

In that case, I guess we have 2 separate bugs. What you are describing is not a JSM loading some resources in the child process, bug a bug in how we propagate userContextId.
I didn't remove the debugString. We can see if we trigger this issue again also with this patch.
https://hg.mozilla.org/mozilla-central/rev/4a6cd6dff941
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
I haven't been able to reproduce the crash for now with today's build (I was able to reproduce it consistently with yesterday's build on Slack). I'll report back if I manage to reproduce it again.
I was looking at Nightly52 crash data and the last signature is still visible on 10/31 builds. Perhaps this fix doesn't work for all attached signatures.
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check

Let's uplift to Aurora51 to stabilize. We need a bit more data before deciding to uplift this in 50 RC builds.
Attachment #8805792 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(In reply to Andrea Marchesini [:baku] from comment #76)
> I didn't remove the debugString. We can see if we trigger this issue again
> also with this patch.
(In reply to Ritu Kothari (:ritu) from comment #79)
> I was looking at Nightly52 crash data and the last signature is still
> visible on 10/31 builds. Perhaps this fix doesn't work for all attached
> signatures.

The null serialized load info was only a small part of this. It was only after I reviewed the patch that I realized it still doesn't fix the main reason for the mismatch: when the child sends userContextId:5 and all the open windows have userContextId:0, or the opposite (0,5).
I'm going to reopen since it sounds like only part of the problem is fixed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I've hit this again, but in a different scenario: bp-06928b3c-1e2d-40c5-866b-564522161101.

I'm not sure what caused it. I didn't have any container tab open.
Good news: I know why we crash. Debugging this issue with Marco (thanks a lot btw) we found why this issue happens. Here the STR:

1. Open in a container tab https://gauntface.github.io/simple-push-demo/
2. enable notification
3. copy the CURL command
4. open a normal tab and close any other container tab.
5. wait 5-10 seconds
6. execute the command - BOOM.

The reason of the crash is this: the ServiceWorker of the simple-push-demo is executed when a push notification is received. This SW runs with the correct UCI but we don't have any container tab of that kind. So we crash.
Woohoo \o/!! Awesome work Baku/Marco :)

The STR are a little different under macOS 10.12.1

* Open a container tab and load https://gauntface.github.io/simple-push-demo/
* enable notifications via the sliding menu
** make sure there's an entry saved under notifications via about:preferences#content
* copy the CURL command from the container tab where notifications have been enabled
* open a new regular tab and load the same website (https://gauntface.github.io/simple-push-demo/)
** notifications shouldn't be enabled in this tab
* close the original container tab where you've enabled notifications so only the default tab is opened
* launch a terminal and run the CURL command that you copied earlier

You'll get the following crash instantly: 

Process 65407 stopped
* thread #1: tid = 0xad6881, 0x000000010260356c XUL`mozilla::net::CrashWithReason(reason="GetValidatedAppInfo | App does not have permission -(1,0)") + 25 at NeckoParent.cpp:105, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010260356c XUL`mozilla::net::CrashWithReason(reason="GetValidatedAppInfo | App does not have permission -(1,0)") + 25 at NeckoParent.cpp:105
   102 	{
   103 	#ifndef RELEASE_OR_BETA
   104 	  MOZ_CRASH_ANNOTATE(reason);
-> 105 	  MOZ_REALLY_CRASH();
   106 	#endif
   107 	}
   108
Attached patch part 2 - SW+Push (obsolete) — Splinter Review
Attachment #8806378 - Flags: review?(valentin.gosu)
Attachment #8806378 - Flags: review?(bugmail)
Comment on attachment 8806378 [details] [diff] [review]
part 2 - SW+Push

Review of attachment 8806378 [details] [diff] [review]:
-----------------------------------------------------------------

With the context of the bug building on our IRC conversation, I think I have an improvement.  Please see the NeckoParent.cpp comment.

I'm shifting r? to :bkelly since although you (:baku) can defer reviews to me for this, I think any ServiceWorkerManager change still wants his awareness/sign-off, but I'm happy to do any follow-up review steps if :bkelly is cool with it in order to help reduce his review burden and/or because this is the area I'm messing with for remoting ServiceWorkerPrivate.

:bkelly, the quick context dump is that NeckoParent does a security check in GetValidatedAppInfo to make sure the request from the content process is legit and it only knows about TabContexts as tracked by the ContentProcessManager, which of course is entirely unaware of what ServiceWorkers get up to.  In this case the push service has decided to spin up a ServiceWorker in response to a push notification in the content process (which it does itself from within the content process).  Glossing a little bit, because the TabContext origin attribute checks are very granular, as long as there is a TabContext with the same (container) mUserContextId, things are cool so usually everything works out.  But when someone is using the non-default container and a service worker gets involved, the potential for crashes goes way up since you're less likely to luck into a TabContext with the same mUserContextId.

::: netwerk/ipc/NeckoParent.cpp
@@ +224,5 @@
>  
>      return nullptr;
>    }
>  
> +  // Maybe this can be a ServiceWorker

It looks like this check occurs too late.  Specifically, I think the ServiceWorker checks wants to occur prior to the `if (contextArray.IsEmpty()) {` check which can induce a crash.

This matters if a ContentParent can exist without having any TabContexts.  Since in this very case the existence of the ServiceWorker is decoupled from the existence of a TabContext, it seems appropriate.

Having said that, I think right now the ContentParent/ContentProcessManager may kill the process once the number of TabContexts goes to zero, *but*:
- I think :gabor may be changing that for efficiency in the nProcesses > 1 case (if he or :mrkbap haven't already).
- My changes for ServiceWorkerPrivate remoting currently call for a dedicated content process to be used which is not tracked by the ContentProcessManager.  This means the GetManagedTabContext() call will assert in ContentProcessManager::GetTabContextByContentProcess() and it'll need to be changed.

So in addition to moving the check up, I suggest we change
  HasRegistrations(nsIPrincipal* aPrincipal);
to instead be:
  MayHaveActiveServiceWorkerInstance(ContentParent* aContent, nsIPrincipal* aPrincipal);

The comment on MayHaveActiveServiceWorkerInstance would be something like:

"""
Return true if the given content process could potentially be executing service worker code with the given principal.  At the current time, this just means that we have any registration for the origin, regardless of scope.  This is a very weak guarantee but is the best we can do when push notifications can currently spin up a service worker in content processes without our involvement in the parent process.

In the future when there is only a single ServiceWorkerManager in the parent process that is entirely in control of spawning and running service worker code, we will be able to authoritatively indicate whether there is an activate service worker in the given content process.  At that time we will rename this method HasActiveServiceWorkerInstance and provide semantics that ensure this method returns true until the worker is known to have shut down in order to allow the caller to induce a crash for security reasons without having to worry about shutdown races with the worker.
"""
Attachment #8806378 - Flags: review?(bugmail) → review?(bkelly)
Attached patch part 3 - test (obsolete) — Splinter Review
Attachment #8806453 - Flags: review?(valentin.gosu)
Attached patch part 2 - SW+Push (obsolete) — Splinter Review
Andrew, thanks for you comments!
Attachment #8806378 - Attachment is obsolete: true
Attachment #8806378 - Flags: review?(valentin.gosu)
Attachment #8806378 - Flags: review?(bkelly)
Attachment #8806455 - Flags: review?(bkelly)
Using the STR from comment#85, you can reproduce this under Win 10 Pro x64 as well.
Comment on attachment 8806453 [details] [diff] [review]
part 3 - test

Review of attachment 8806453 [details] [diff] [review]:
-----------------------------------------------------------------

I <3 tests!
Attachment #8806453 - Flags: review?(valentin.gosu) → review+
Comment on attachment 8806455 [details] [diff] [review]
part 2 - SW+Push

Review of attachment 8806455 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/ipc/NeckoParent.cpp
@@ +98,5 @@
>    return kPBOverride_Unset;
>  }
>  
> +static already_AddRefed<nsIPrincipal>
> +GetRequestingPrincipalFromOptionalLoadInfoArgs(const OptionalLoadInfoArgs aOptionalLoadInfoArgs)

Maybe pass the arg as a reference?
And how about a shorter name? GetRequestingPrincipal should be clear enough.

@@ +214,5 @@
>  
>      return nullptr;
>    }
>  
> +  // Maybe this can be a ServiceWorker

It would be nice if this explained why it is necessary to check for active service workers.
Attached patch part 2 - SW+PushSplinter Review
Here the comment and the shorter name. I cannot pass a reference of nsIPrincipal because PrincipalInfoToPrincipal allocates a new nsIPrincipal and I want to keep this code safe.
Attachment #8805792 - Attachment is obsolete: true
Attachment #8806455 - Attachment is obsolete: true
Attachment #8805792 - Flags: approval-mozilla-beta?
Attachment #8806455 - Flags: review?(bkelly)
Attachment #8806604 - Flags: review?(valentin.gosu)
Attachment #8806604 - Flags: review?(bkelly)
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check

We still need this patch in beta.
Attachment #8805792 - Attachment description: crash_necko.patch → Part 1 - IsNotNull() check
Attachment #8805792 - Attachment is obsolete: false
Attachment #8805792 - Flags: approval-mozilla-beta?
Comment on attachment 8806604 [details] [diff] [review]
part 2 - SW+Push

Review of attachment 8806604 [details] [diff] [review]:
-----------------------------------------------------------------

r=me on the Necko bits. Thanks!
Attachment #8806604 - Flags: review?(valentin.gosu) → review+
Comment on attachment 8806604 [details] [diff] [review]
part 2 - SW+Push

Review of attachment 8806604 [details] [diff] [review]:
-----------------------------------------------------------------

I spoke with Andrea about this on IRC.  I think this is a bandaid because we don't really know if the content process is running the service worker.  Its a good step for now, though, and is worth doing to fix the crash in beta.

Please add a comment and follow-up bug that this check should be removed or improved once we move to our new e10s service worker model.
Attachment #8806604 - Flags: review?(bkelly) → review+
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check

I don't think the entire fix is ready yet, clearing the beta nom.
Attachment #8805792 - Flags: approval-mozilla-beta?
(In reply to Ben Kelly [:bkelly] from comment #96)
> Comment on attachment 8806604 [details] [diff] [review]
> part 2 - SW+Push
> 
> Review of attachment 8806604 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I spoke with Andrea about this on IRC.  I think this is a bandaid because we
> don't really know if the content process is running the service worker.  Its
> a good step for now, though, and is worth doing to fix the crash in beta.
> 
> Please add a comment and follow-up bug that this check should be removed or
> improved once we move to our new e10s service worker model.

Hi Ben,

Are you saying the bandaid may not fix the crash?  Or the bandaid will fix the crash, but there is more followup work to do here.
(In reply to Ritu Kothari (:ritu) from comment #97)
> Comment on attachment 8805792 [details] [diff] [review]
> Part 1 - IsNotNull() check
> 
> I don't think the entire fix is ready yet, clearing the beta nom.

Once baku lands this in inbound, we can re-nominate all the necessary patches for beta.  I'm hoping that will be tomorrow.
(In reply to Tanvi Vyas [:tanvi] from comment #98)
> Are you saying the bandaid may not fix the crash?  Or the bandaid will fix
> the crash, but there is more followup work to do here.

I'm saying its not actually providing any anti-spoofing protection like our TabContext.  We are basically just allowing it through anyway if we have a SW registration matching the principal.  We have no positive confirmation that SW is running in the child process creating the channel.

Long term we will have a data structure in the parent process that knows where service workers should be running.  At that point we can do a better job of preventing someone from spoofing a child process service worker.
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0473f2d0d024
NeckoParent::GetValidatedAppInfo should consider ServiceWorkers when validating HttpChannel requests, r=bkelly, r=valentin, f=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/40e351ced11b
Test about Containers, SW and PushNotifications, r=valentin
Backed out for eslint failure (missing globals):

https://hg.mozilla.org/integration/mozilla-inbound/rev/439965937366d0c2a504f8ca8cab1ed07be24fe7
https://hg.mozilla.org/integration/mozilla-inbound/rev/0bf3ab8a2a94a1d1b8299e1e9395f0361b1b7dba

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=40e351ced11b25aa0bdb8bd572e6e68ba1609c18
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=38582322&repo=mozilla-inbound 

[task 2016-11-02T21:03:31.378029Z] TEST-UNEXPECTED-ERROR | /home/worker/checkouts/gecko/browser/components/contextualidentity/test/browser/browser_pushnotification.js:44:26 | 'ChromeUtils' is not defined. (no-undef)
[task 2016-11-02T21:03:31.378102Z] TEST-UNEXPECTED-ERROR | /home/worker/checkouts/gecko/browser/components/contextualidentity/test/browser/pushworker.js:1:1 | 'onpush' is not defined. (no-undef)
Flags: needinfo?(amarchesini)
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a4ee836c49a3
NeckoParent::GetValidatedAppInfo should consider ServiceWorkers when validating HttpChannel requests, r=bkelly, r=valentin, f=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/8a072e37f540
Test about Containers, SW and PushNotifications, r=valentin
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check

Ritu, we really need this for beta. As we need the following patch.
Together the fix the issue completely.

The reason why need them is because otherwise we could crash in this scenario: on mac, FF is running but without tabs/windows. A push notification is received. We crash because this necko check fails.
Flags: needinfo?(amarchesini)
Attachment #8805792 - Flags: approval-mozilla-beta?
Comment on attachment 8806604 [details] [diff] [review]
part 2 - SW+Push

See previous comments
Attachment #8806604 - Flags: approval-mozilla-beta?
Attachment #8806604 - Flags: approval-mozilla-aurora?
Note that these 2 patches do not fix a container problem. It's more generic than this because it can happen (I'm not 100% sure yet) in mac when no windows/tabs are opened and a push notification is received.
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5ec89b4b058e
Backed out changeset 8a072e37f540 for failures in browser_pushnotification.js
(In reply to Andrea Marchesini [:baku] from comment #106)
> Note that these 2 patches do not fix a container problem. It's more generic
> than this because it can happen (I'm not 100% sure yet) in mac when no
> windows/tabs are opened and a push notification is received.

Steps to reproduce:
1. Open in a tab https://gauntface.github.io/simple-push-demo/
2. Enable notifications
3. Copy the cURL command
4. Close all tabs (but make sure Firefox is still running)
5. Wait 5-10 seconds
6. Execute the cURL command

Marcia was able to reproduce this crash with Nightly, but not with Beta.
https://crash-stats.mozilla.com/report/index/dee1c77c-792f-4457-b61c-1d4a52161103

Andi tried with Nightly and was able to crash when he had a tab open (without any tab open Firefox didn't crash but didn't receive the notification either).
https://crash-stats.mozilla.com/report/index/b65d67df-fa3d-4973-b567-a6bba2161103
With the patch applied he wasn't able to reproduce the crash (but didn't receive the notification either, he only received it if https://gauntface.github.io/simple-push-demo/ was open).
I went through the same STR from comment#108 using the latest debug beta and couldn't reproduce the issue either.

Build Used:
* fx50.0b12, buildId: 20161103105945, changeset: a0a503f568d8

Results:
* https://pastebin.mozilla.org/8924554
Comment on attachment 8806604 [details] [diff] [review]
part 2 - SW+Push

Let's stabilize on 51 before we decide whether to uplift to Beta50 or not.
Attachment #8806604 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Hi Andrei, could you please have your team repro this issue on Beta50 (see comment 108)? Most folks are unable to repro on Beta50 which make me feel this is not a must uplift to RC2. Thank you!
Flags: needinfo?(andrei.vaida)
https://hg.mozilla.org/mozilla-central/rev/a4ee836c49a3
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
I've tested this on 50b6, 50b11, 52.0a2 (2016-11-04) on OS X 10.11 and no crashes occurred.
50b6 - without any tab open (Fx was still running in the background), Firefox didn't crash and receive the notification. Same result with 1 tab open.
50b11, 52.0a2 (2016-11-04) - without any tab open, Firefox didn't crash but didn't receive the notification either. With 1 tab open, Firefox didn't crash and receive the notification.
Flags: needinfo?(andrei.vaida)
Another reason why it _could_ crash, is with signedPkg. But I don't know enough about this signed packages to know if this is an issue here.
Flags: needinfo?(tanvi)
(In reply to Andrea Marchesini [:baku] from comment #115)
> Another reason why it _could_ crash, is with signedPkg. But I don't know
> enough about this signed packages to know if this is an issue here.

signedPkg was implemented for New Security Model in Firefox OS.
I think this attribute is no longer needed.  Maybe we should remove it at some point of time.
(In reply to Andrea Marchesini [:baku] from comment #115)
> Another reason why it _could_ crash, is with signedPkg. But I don't know
> enough about this signed packages to know if this is an issue here.

I don't know much about this either.  Is signedPkg still used for anything?
Flags: needinfo?(tanvi)
Flags: needinfo?(ptheriault)
Flags: needinfo?(dveditz)
Flags: needinfo?(allstars.chh)
We still don't know that this crash won't occur in Firefox 50 release, particularly with the Test Pilot experiment for Containers.  I'm not sure why the STR aren't working in beta7, but the STR were very hard to track down in the first place.  The technical issue that was patched for 51 and 52 exists in 50 as well, even though we can't seem to reproduce the crash.
Paul, have you tested both with e10s turned off and turned on?
Flags: needinfo?(paul.silaghi)
Wes or tomcat, did this land in aurora? Thanks.
Flags: needinfo?(wkocher)
Flags: needinfo?(cbook)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #120)
> Wes or tomcat, did this land in aurora? Thanks.

It did now.
Flags: needinfo?(wkocher)
Flags: needinfo?(cbook)
This late in 50 cycle, I am unable to justify uplifting a patch for a TestPilot project. We should let this fix stabilize on Beta51 for a few weeks and ship with 50.1.0 if deemed safe.
Flags: needinfo?(allstars.chh)
(In reply to Marco Castelluccio [:marco] from comment #119)
> Paul, have you tested both with e10s turned off and turned on?
Just with e10s=ON.
Tried again, with e10s=OFF:
50b6, 50b11, 52.0a2 (2016-11-04) - without any tab open (Fx was still running in the background), Firefox didn't crash and receive the notification. Same result with 1 tab open.
Flags: needinfo?(paul.silaghi)
Bug 1301406 must be uplifted too otherwise these 2 patches do not apply to m-a.
Flags: needinfo?(amarchesini)
Depends on: 1301406
(In reply to Tanvi Vyas [:tanvi] from comment #117)
> Is signedPkg still used for anything?

Doesn't look like it. It should go when the appID stuff does.
Flags: needinfo?(dveditz)
(In reply to Tanvi Vyas [:tanvi] from comment #117)
> I don't know much about this either.  Is signedPkg still used for anything?

Valentin already removed signedPkg from Origin Attributes in bug 1315302.
has problems to uplift like applying crash_necko.patch
patching file netwerk/ipc/NeckoParent.cpp
Hunk #1 FAILED at 115
Hunk #2 FAILED at 167
2 out of 2 hunks FAILED -- saving rejects to file netwerk/ipc/NeckoParent.cpp.rej
patch failed, unable to continue (try -v)
patch failed, rejects left in working directory
errors during apply, please fix and qrefresh crash_necko.patch
Flags: needinfo?(amarchesini)
Tomcat, see comment 126.
Flags: needinfo?(amarchesini)
(In reply to Andrea Marchesini [:baku] from comment #131)
> Tomcat, see comment 126.

ok uplifting now bug 1301406 but there seems to be still problems here for this patch:

373003:a4ee836c49a3 "Bug 1289001 - NeckoParent::GetValidatedAppInfo should consider ServiceWorkers when validating HttpChannel requests, r=bkelly, r=valentin, f=asuth"
merging dom/workers/ServiceWorkerManager.cpp
merging netwerk/ipc/NeckoParent.cpp
merging netwerk/ipc/NeckoParent.h
merging netwerk/protocol/wyciwyg/WyciwygChannelParent.cpp
warning: conflicts while merging netwerk/ipc/NeckoParent.cpp! (edit, then use 'hg resolve --mark')
abort: unresolved conflicts, can't continue
(use 'hg resolve' and 'hg graft --continue')
Attached patch patch 2 for 51 (obsolete) — Splinter Review
Flags: needinfo?(cbook)
Ok, I suspect what it's happening here. But I need a feedback from Bill. Between PBrowserChild::Send__delete__(mTabChild); and TabChild::ActorDestroy we run the event loop. So if an httpChannel is asking for IPCOpen(), we return true. But, in the parent process, we delete the tabContext when Recv__delete__ is received.

I guess the solution is to move mIPCOpen immediately before calling PBrowserChild::Send__delete__(mTabChild).
Flags: needinfo?(wmccloskey)
Attached patch part 3 - tabChild (obsolete) — Splinter Review
Flags: needinfo?(wmccloskey)
Attachment #8809723 - Flags: review?(wmccloskey)
Flags: needinfo?(cbook)
Attachment #8809425 - Attachment is patch: true
Untagging as blocking release, leaving it tracked for 50 in case we have a safe, well-tested fix ready for uplift before 50.1.0 release go-live.
Reopening since there's still patches landing here. We should be cognizant of the next Monday's uplift though and consider using a new bug for further work to avoid making this already-complicated bug more difficult to track.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla52 → ---
Attachment #8798043 - Flags: checkin+
Attachment #8805792 - Flags: checkin+
Attachment #8806453 - Attachment is obsolete: true
Attachment #8806453 - Flags: checkin-
Attachment #8806604 - Flags: checkin+
Attachment #8809425 - Flags: checkin+
Comment on attachment 8809723 [details] [diff] [review]
part 3 - tabChild

Review of attachment 8809723 [details] [diff] [review]:
-----------------------------------------------------------------

(Taking over the review at :baku's request after discussing it on IRC.)

::: dom/ipc/TabChild.h
@@ +748,5 @@
>    {
>      mUnscaledInnerSize = aSize;
>    }
>  
> +  void Send__delete__();

Nit: it would be clearer if this method had a different name from the IPDL Send__delete__, maybe SendDeleteIfOpen or something like that.
Attachment #8809723 - Flags: review?(wmccloskey) → review+
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/419fcafe1794
Fix a race condition in the use of TabChild::mIPCOpen, r=jld
Attachment #8809723 - Attachment description: part 4 - tabChild → part 3 - tabChild
Attachment #8809723 - Flags: approval-mozilla-beta?
Attachment #8809723 - Flags: approval-mozilla-aurora?
(In reply to Andrea Marchesini [:baku] from comment #134)
> Ok, I suspect what it's happening here. But I need a feedback from Bill.
> Between PBrowserChild::Send__delete__(mTabChild); and TabChild::ActorDestroy
> we run the event loop.

Why do you think that? Looking at the code for Send__delete__:
http://searchfox.org/mozilla-central/source/__GENERATED__/ipc/ipdl/PBrowserChild.cpp#2040
It calls DestroySubtree, which calls ActorDestroy:
http://searchfox.org/mozilla-central/source/__GENERATED__/ipc/ipdl/PBrowserChild.cpp#5172
Flags: needinfo?(amarchesini)
Comment on attachment 8809723 [details] [diff] [review]
part 3 - tabChild

You are right. Tomcat, can you revert https://hg.mozilla.org/integration/mozilla-inbound/rev/419fcafe1794 ?
Attachment #8809723 - Attachment is obsolete: true
Flags: needinfo?(amarchesini) → needinfo?(cbook)
Attachment #8809723 - Flags: approval-mozilla-beta?
Attachment #8809723 - Flags: approval-mozilla-aurora?
Flags: needinfo?(wmccloskey)
Flags: needinfo?(amarchesini)
Track 51+ as the #3 of top crash.
Actually I don't see any new crash. All the crashes are coming from not-updated nightly. Marco, can you confirm this?
Flags: needinfo?(wmccloskey) → needinfo?(mcastelluccio)
Interesting: in this crash-report list, there are just crashes with userContextId 5. No containers involved.
(In reply to Andrea Marchesini [:baku] from comment #148)
> Interesting: in this crash-report list, there are just crashes with
> userContextId 5. No containers involved.

What's userContextId 5?
> What's userContextId 5?

about:newtab uses userContextId in a hidden xul:browser, used for thumbnail generation.
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check

We might want to take this fix in 50.1.0.
Attachment #8805792 - Flags: approval-mozilla-release?
Attachment #8806604 - Flags: approval-mozilla-release?
This bug is getting messy to track given the number of comments/landings/backouts and today's merges on top of that. Let's call this specific bug fixed and file a new bug to track any further work that needs to happen on trunk for any remaining problem spots. And obviously we still need to sort out the branch uplifts here.
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
See Also: → 1317611
Attachment #8809425 - Attachment description: patch 2 for m-a → patch 2 for 51
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check

Since merge day is passed and we may still need the patch in 51 beta, I remove the approval‑mozilla‑aurora+ flag and keep approval‑mozilla‑beta? for 51.
Attachment #8805792 - Flags: approval-mozilla-aurora+
Attachment #8806604 - Flags: approval-mozilla-aurora+
Hi baku,
Since the patch for 51 was backed out because of some assert errors, I'm still waiting for the new patches for 51 beta. Need info you to put this in your radar.
Flags: needinfo?(amarchesini)
Attached patch patch 2 for 51Splinter Review
Attachment #8809425 - Attachment is obsolete: true
Flags: needinfo?(amarchesini)
Attachment #8811718 - Attachment is patch: true
Comment on attachment 8806604 [details] [diff] [review]
part 2 - SW+Push

Might as well move the requests over to the patch that'd actually be going to Beta/Release if approved.
Attachment #8806604 - Flags: approval-mozilla-release?
Attachment #8806604 - Flags: approval-mozilla-beta?
Attachment #8811718 - Flags: approval-mozilla-release?
Attachment #8811718 - Flags: approval-mozilla-beta?
Hi Ryan,
Do you know Which one is for beta uplift request? both or patch 2 for 51 only?
Flags: needinfo?(ryanvm)
Both
Flags: needinfo?(ryanvm)
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check

Fix a crash. Beta51+. Should be in 51 beta 2.
Attachment #8805792 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8811718 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a206b81a650d
Remove the MOZ_NORETURN hint from CrashWithReason. r=valentin
Flags: needinfo?(ptheriault)
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check

Top crasher on 50, meets the triage bar for inclusion in 50.1.0
Attachment #8805792 - Flags: approval-mozilla-release? → approval-mozilla-release+
Attachment #8811718 - Flags: approval-mozilla-release? → approval-mozilla-release+
Is it too late for submitting a new patch for release?
Flags: needinfo?(amarchesini) → needinfo?(ryanvm)
Yes, 51 goes to RC next week.
Flags: needinfo?(ryanvm)
You need to log in before you can comment on or make changes to this bug.