Closed
Bug 1289001
Opened 8 years ago
Closed 8 years ago
Crash in IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b
Categories
(Core :: Networking: HTTP, defect, P1)
Tracking
()
RESOLVED
FIXED
mozilla52
People
(Reporter: n.nethercote, Assigned: baku)
References
(Blocks 1 open bug)
Details
(Keywords: crash, regression, Whiteboard: [necko-active][OA])
Crash Data
Attachments
(5 files, 6 obsolete files)
4.61 KB,
patch
|
jduell.mcbugs
:
review+
RyanVM
:
checkin+
|
Details | Diff | Splinter Review |
6.48 KB,
text/plain
|
Details | |
4.06 KB,
patch
|
valentin
:
review+
gchang
:
approval-mozilla-beta+
ritu
:
approval-mozilla-release+
RyanVM
:
checkin+
|
Details | Diff | Splinter Review |
17.44 KB,
patch
|
bkelly
:
review+
valentin
:
review+
RyanVM
:
checkin+
|
Details | Diff | Splinter Review |
17.61 KB,
patch
|
gchang
:
approval-mozilla-beta+
ritu
:
approval-mozilla-release+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is
report bp-1f2056b1-5998-465e-bdc3-46ed22160725.
=============================================================
New crash, first seen in Nightly 20160722030235 and it is the #9 topcrash in that build. It has occurred 36 times across multiple installations, on both Windows and Mac.
In all the crash reports thread 0 is reported as the crashing thread, but the stack trace varies widely across the different crash report and so appears to be unrelated. The processor notes field says 'Signature replaced with an IPC Channel Error, was: "<func>"'.
The full abort message is "Value error: message was deserialized, but contained an illegal value".
I'm not sure how to interpret this one. mrbkap, any ideas from the IPC side? Valentin, any ideas from the Necko side?
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(mrbkap)
Comment 1•8 years ago
|
||
I'm not entirely sure, but I believe this means that one of the parameters (from the child to the parent) failed to unwrap during the IPC call. I glanced through checkins to netwerk/ from July 21-today and didn't see anything obvious, but I didn't look too closely so I could have missed something. This could have also meant that AllocPHttpChannelParent failed, but I'm not sure why that would be (one thing that can cause it to fail is if we're trying to create an HTTP channel from a content process that doesn't have a tab parent, is anybody trying that yet?).
Flags: needinfo?(mrbkap)
Comment 2•8 years ago
|
||
Updated•8 years ago
|
OS: Windows 10 → All
Reporter | ||
Comment 3•8 years ago
|
||
(In reply to Ludovic Hirlimann [:Usul] from comment #2)
> url to reproduce :
> http://c.ymcdn.com/sites/hmgs.site-ym.com/resource/resmgr/Historicon/
> HC16_PEL_05_13.pdf
I tried and failed to reproduce on three different machines (Windows, Mac, Linux) all running Nightly.
Were there any specific steps to reproduce with this URL? I just scrolled slowly through the PDF to the bottom.
Flags: needinfo?(ludovic)
Comment 4•8 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #3)
> (In reply to Ludovic Hirlimann [:Usul] from comment #2)
> > url to reproduce :
> > http://c.ymcdn.com/sites/hmgs.site-ym.com/resource/resmgr/Historicon/
> > HC16_PEL_05_13.pdf
>
> I tried and failed to reproduce on three different machines (Windows, Mac,
> Linux) all running Nightly.
>
> Were there any specific steps to reproduce with this URL? I just scrolled
> slowly through the PDF to the bottom.
current nightly doesn't repro. :(
Flags: needinfo?(ludovic)
Updated•8 years ago
|
Whiteboard: [necko-next]
Hi Ludovic, can you try to repro using Aurora50? This is a top content crasher on the Aurora channel/
Flags: needinfo?(ludovic)
Comment 6•8 years ago
|
||
(In reply to Ritu Kothari (:ritu) from comment #5)
> Hi Ludovic, can you try to repro using Aurora50? This is a top content
> crasher on the Aurora channel/
I was just loading the url from comment 2 when I was crashing for sure. this doesn't work anymore :(
Flags: needinfo?(ludovic)
Comment 7•8 years ago
|
||
Crash volume for signature 'IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b':
- nightly (version 51): 704 crashes from 2016-08-01.
- aurora (version 50): 2103 crashes from 2016-08-01.
- beta (version 49): 0 crashes from 2016-08-02.
- release (version 48): 0 crashes from 2016-07-25.
- esr (version 45): 0 crashes from 2016-05-02.
Crash volume on the last weeks (Week N is from 08-22 to 08-28):
W. N-1 W. N-2 W. N-3
- nightly 232 178 136
- aurora 865 609 174
- beta 0 0 0
- release 0 0 0
- esr 0 0 0
Affected platforms: Windows, Mac OS X, Linux
Crash rank on the last 7 days:
Browser Content Plugin
- nightly #5
- aurora #3
- beta
- release
- esr
status-firefox51:
--- → affected
Comment 8•8 years ago
|
||
Currently reproducing with http://www.the-ninth-age.com/filebase/index.php?download/18/
Hi Bill, is this content crash similar or same as the browser crash you fixed in bug 1277582?
Flags: needinfo?(wmccloskey)
Comment 10•8 years ago
|
||
STR:
1. Open latest aurora.
2. Set the pref "privacy.userContext.enabled" and "privacy.userContext.ui.enabled" set on true.
3. Restart the browser .
4. Press alt and from File->New Container Tab->Work.
5. Open the console and only activate .Net>XHR.
6. Go to www.cnn.com and click a news.
Notes:
a. The machines had Intel Graphics.
b. e10s was activated.
c. This occured on 3 differite machines, 2 on Windows 10 and one Windows 7.
d. This happens with all the container tabs.
Crash reports:
https://crash-stats.mozilla.com/report/index/76b9b756-c5b8-490c-8cbb-3e2132160915
https://crash-stats.mozilla.com/report/index/432d70a5-c0f3-4d86-99e0-677052160915
https://crash-stats.mozilla.com/report/index/98ff4e8e-c80f-4fad-91e3-5f7312160915
Comment 11•8 years ago
|
||
I've just hit this crash on Linux, when I closed a Google docs page which was in a separate container: https://crash-stats.mozilla.com/report/index/11a476a5-a067-49bc-b119-485682160919.
Comment 12•8 years ago
|
||
From what I can tell deserialization of SerializedLoadContext fails because return
> aResult->mOriginAttributes.PopulateFromSuffix(suffix);
returns false.
This bug was filed right after bug 1295103 landed.
I'm not exactly sure why the call fails. Kan-Ru, do you have time to take a look at this?
Blocks: ContextualIdentity
Flags: needinfo?(valentin.gosu) → needinfo?(kchen)
Were you able to reproduce this Valentin? I wasn't. Based on what I can figure out, it seems like AllocPHttpChannelParent might be returning null. That's the only place I can find where we return MsgValueError without first calling FatalError (which will crash the parent).
However, your comment suggests that one of the Read functions on the arguments is failing. If so, I wonder why we're not crashing first in FatalError?
Flags: needinfo?(wmccloskey) → needinfo?(valentin.gosu)
Comment 14•8 years ago
|
||
(In reply to Bill McCloskey (:billm) from comment #13)
> Were you able to reproduce this Valentin? I wasn't.
No, I haven't. I was basing my analysis on the crash reports.
> Based on what I can
> figure out, it seems like AllocPHttpChannelParent might be returning null.
> That's the only place I can find where we return MsgValueError without first
> calling FatalError (which will crash the parent).
>
> However, your comment suggests that one of the Read functions on the
> arguments is failing. If so, I wonder why we're not crashing first in
> FatalError?
You're right, if it were the deserialization it should have called FatalError.
It is however possible for AllocPHttpChannelParent to return nullptr if NeckoParent::CreateChannelLoadContext returns an error string because in NeckoParent::GetValidatedAppInfo aSerialized is "null".
That means the problem would be on the child side - so in SerializedLoadContext(nsIChannel*), NS_QueryNotificationCallbacks turns up a null load context. I'm not sure why that is, or what we should do in that case.
Jason, any idea what the behaviour is supposed to be in this case?
Flags: needinfo?(valentin.gosu) → needinfo?(jduell.mcbugs)
Comment 15•8 years ago
|
||
So for B2G apps we were quite strict--if a channel from the child didn't have a LoadContext (SerializedLoadContext, to be precise), we killed it the NeckoParent security checks. That logic is still there.
So this sounds like a programming error in the child somewhere. Something is creating a channel without giving it a LoadInfo in its callbacks.
Flags: needinfo?(jduell.mcbugs)
Updated•8 years ago
|
Flags: needinfo?(kchen)
Comment 16•8 years ago
|
||
[Tracking Requested - why for this release]:
this signature is regressing since firefox 50 builds and is currently the #1 crasher in the content process with 12% of all crashes in 50.0b1 crash data.
tracking-firefox50:
--- → ?
Keywords: regression
Comment 18•8 years ago
|
||
(In reply to Jason Duell [:jduell] (needinfo me) from comment #15)
> So for B2G apps we were quite strict--if a channel from the child didn't
> have a LoadContext (SerializedLoadContext, to be precise), we killed it the
> NeckoParent security checks. That logic is still there.
>
> So this sounds like a programming error in the child somewhere. Something
> is creating a channel without giving it a LoadInfo in its callbacks.
Ehsan, is this still the case? It seems to me that in the SerializedLoadContext constructor, we still handle the case when !loadContext. Should we change the HttpChannelParent constructor so that it allows a null loadContext?
Thanks!
Comment 20•8 years ago
|
||
(In reply to Valentin Gosu [:valentin] from comment #18)
> (In reply to Jason Duell [:jduell] (needinfo me) from comment #15)
> > So for B2G apps we were quite strict--if a channel from the child didn't
> > have a LoadContext (SerializedLoadContext, to be precise), we killed it the
> > NeckoParent security checks. That logic is still there.
> >
> > So this sounds like a programming error in the child somewhere. Something
> > is creating a channel without giving it a LoadInfo in its callbacks.
>
> Ehsan, is this still the case?
If you mean, do we still have the code to kill the child when invalid stuff is found in Necko IPC, then the answer is yes. This is controlled by the network.disable.ipc.security pref, which is true on desktop (which means, the said security checks are bypassed.)
I think this is completely irrelevant to this bug though, FWIW.
> It seems to me that in the
> SerializedLoadContext constructor, we still handle the case when
> !loadContext. Should we change the HttpChannelParent constructor so that it
> allows a null loadContext?
> Thanks!
I don't understand what you mean here. A SerializedLoadContext can be created from an nsILoadContext* or from other things, including nullptr, where we won't have a load context pointer, but that's fine. That's why the branch handling a null load context exists.
Now, let me give a shot at debugging this. :-)
Comment 10 says that this happens with the containers turned on. Containers are implemented by the userContextId field on OriginAttributes. In comment 12, you said that this is happening because PopulateFromSuffix() is returning false, which makes sense if the bug has something to do with containers. Note that there are a couple of cases where parsing "userContextId" would fail, for example if its value isn't an integer or is above the uint32_t range. Now, there is some OA manipulation happening inside NeckoParent::GetValidatedAppInfo(), but the thing that looks suspect is this code: <http://searchfox.org/mozilla-central/source/netwerk/ipc/NeckoParent.cpp#110,143> Note how we're trying to match the userContextIds, and if we get a mismatch, we'll end up here <http://searchfox.org/mozilla-central/source/netwerk/ipc/NeckoParent.cpp#157> returning an error which would crash with the symptoms you're seeing here. I bet this is the bug.
Now the $1,000,000 question is: how do we end up with a mismatching user context ID. Obviously if we managed to reproduce this, we can easily debug what's happening. But assuming that we can't (I'm assuming you've been testing with loading stuff in tabs from different containers, which is how you'd trigger the code in question...), we need to come up with theories on what could cause this. For example, could we end up in a situation where the TabContext's notion of its OriginAttributes is out of date with the child process?
I'm CCing baku and tanvi who are familiar with the container feature.
Flags: needinfo?(ehsan)
Tracking this top crasher as a blocking issue for Fx50.
Hi Baku, Tanvi, this is a top crasher on Beta50. Ehsan has some questions in comment 20. I hope you can help root cause and we can have a potential fix ready soon. Thanks!
Flags: needinfo?(tanvi)
Flags: needinfo?(amarchesini)
Comment 23•8 years ago
|
||
(In reply to Ritu Kothari (:ritu) from comment #22)
> Hi Baku, Tanvi, this is a top crasher on Beta50. Ehsan has some questions in
> comment 20. I hope you can help root cause and we can have a potential fix
> ready soon. Thanks!
My questions were actually directed at Valentin. Needinfoing him instead.
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(tanvi)
Flags: needinfo?(amarchesini)
Comment 24•8 years ago
|
||
mozilla::ipc::ProcessLink::SendMessageW on 49 is probably due to the same bug.
I see the following correlations:
(100.0% in signature vs 00.09% overall) moz_crash_reason = MOZ_CRASH(IPC message size is too large)
(100.0% in signature vs 39.88% overall) reason = EXCEPTION_BREAKPOINT
(42.94% in signature vs 00.03% overall) ipc_message_name = PContent::Reply_ReadPrefsArray
(40.59% in signature vs 00.03% overall) ipc_message_name = PNecko::Msg_PHttpChannelConstructor
(42.94% in signature vs 09.30% overall) useragent_locale = ru
(30.59% in signature vs 02.55% overall) Addon "McAfee SiteAdvisor" = true
(38.24% in signature vs 00.03% overall) ipc_message_name = PNecko::Msg_PHttpChannelConstructor ∧ useragent_locale = ru
The 40% of the reports that have (ipc_message_name = PNecko::Msg_PHttpChannelConstructor ∧ useragent_locale = ru) are really similar to this bug.
I will file a new bug for the other ipc_message_name.
(In 50 and up, that signature is related to bug 1288997)
Crash Signature: [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b] → [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b]
[@ mozilla::ipc::ProcessLink::SendMessageW ]
Comment 25•8 years ago
|
||
So, looks like 49 is affected, but with a smaller volume and a different signature (unless it is a different crash).
The McAfee SiteAdvisor addon is correlated with the crash with the other ipc_message_name ((30.59% in signature vs 00.02% overall) Addon "McAfee SiteAdvisor" = true ∧ ipc_message_name = PContent::Reply_ReadPrefsArray).
Comment 26•8 years ago
|
||
Actually, the crash caused by a too large IPC message with PNecko::Msg_PHttpChannelConstructor is tracked in bug 1277681.
Crash Signature: [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b]
[@ mozilla::ipc::ProcessLink::SendMessageW ] → [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b]
Comment 27•8 years ago
|
||
Crash volume for signature 'IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b':
- nightly (version 52): 364 crashes from 2016-09-19.
- aurora (version 51): 660 crashes from 2016-09-19.
- beta (version 50): 4279 crashes from 2016-09-20.
- release (version 49): 0 crashes from 2016-09-05.
- esr (version 45): 0 crashes from 2016-06-01.
Crash volume on the last weeks (Week N is from 10-03 to 10-09):
W. N-1 W. N-2
- nightly 223 141
- aurora 563 97
- beta 3398 881
- release 0 0
- esr 0 0
Affected platforms: Windows, Mac OS X, Linux
Crash rank on the last 7 days:
Browser Content Plugin
- nightly #7
- aurora #4
- beta #1
- release
- esr
status-firefox52:
--- → affected
Updated•8 years ago
|
Assignee: nobody → valentin.gosu
Flags: needinfo?(valentin.gosu)
Whiteboard: [necko-next] → [necko-active]
Comment 28•8 years ago
|
||
Comment 29•8 years ago
|
||
Since we weren't able to reproduce this issue with the STR, I want push a diagnostic patch first, to figure out what's going on.
Attachment #8798043 -
Flags: review?(jduell.mcbugs)
Updated•8 years ago
|
Attachment #8798043 -
Flags: review?(jduell.mcbugs) → review+
Comment 30•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/871153c1b8b26da037fba183a558cbec584dfc09
Bug 1289001 - Add code to GetValidatedAppInfo to figure out why it fails r=jduell
Comment 31•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Updated•8 years ago
|
Updated•8 years ago
|
Target Milestone: mozilla52 → ---
Comment 32•8 years ago
|
||
Comment on attachment 8798043 [details] [diff] [review]
bug1289001-diagnose.patch
Review of attachment 8798043 [details] [diff] [review]:
-----------------------------------------------------------------
::: netwerk/ipc/NeckoParent.cpp
@@ +113,5 @@
> +#ifndef RELEASE_BUILD
> + MOZ_CRASH_ANNOTATE(reason);
> + MOZ_REALLY_CRASH();
> +#endif
> +}
FWIW, we have MOZ_DIAGNOSTIC_ASSERT for this exact purpose. You didn't need to roll your own here. :-)
Updated•8 years ago
|
Crash Signature: [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b] → [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b]
[@ mozilla::net::CrashWithReason ]
Comment 34•8 years ago
|
||
So the debug code has some interesting results.
The following pairs represent us trying to match the first field aSerialized.mOriginAttributes.mUserContextId to the second field tabContext.OriginAttributesRef().mUserContextId:
https://crash-stats.mozilla.com/report/index/533e3552-3d5b-4c60-a8fb-905c62161007
-(5,0)(5,0)
https://crash-stats.mozilla.com/report/index/509cac1f-8555-47d6-873d-ec4d22161006
-(0,2)(0,2)(0,2)(0,2)(0,2)
It seems that we're both trying to trying to create a channel with a non-zero userContextId in a tab with a zero ContextId, and vice-versa.
Updated•8 years ago
|
Crash Signature: [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b]
[@ mozilla::net::CrashWithReason ] → [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b]
[@ mozilla::net::CrashWithReason ]
[@ mozilla::net::NeckoParent::GetValidatedAppInfo ]
Comment 35•8 years ago
|
||
FYI - I got crash in [@ mozilla::net::CrashWithReason ] on 52.0a1 [e10s enabled] after closing page which was still loading images
Crashlog report - https://crash-stats.mozilla.com/report/index/0580288c-4f4a-4368-ab37-7801b2161008
Comment 36•8 years ago
|
||
(In reply to Virtual_ManPL [:Virtual] - (ni? me) from comment #35)
> FYI - I got crash in [@ mozilla::net::CrashWithReason ] on 52.0a1 [e10s
> enabled] after closing page which was still loading images
> Crashlog report -
> https://crash-stats.mozilla.com/report/index/0580288c-4f4a-4368-ab37-7801b2161008
Thanks for the report. It seems we don't only crash for container tabs.
GetValidatedAppInfo | ContentParent does not have any PBrowsers
Comment 37•8 years ago
|
||
I'm just trying to wrap my head around this. It seems most of the crashes occur for (0,5) or (5,0) pairs. What do we use contextId 5 for? Tanvi, do you know if this is for an addon or something in our code?
There are a few for contextIDs 1,2,3,4 as well, but fewer. I assume might be some kind of race between opening channels and closing windows with a certain TabContext.
Flags: needinfo?(wmccloskey)
Flags: needinfo?(tanvi)
Comment 38•8 years ago
|
||
For background page thumbnail code:
http://searchfox.org/mozilla-central/source/toolkit/components/contextualidentity/ContextualIdentityService.jsm#74
http://searchfox.org/mozilla-central/search?q=userContextIdInternal.thumbnail&path=
Flags: needinfo?(wmccloskey)
Flags: needinfo?(tanvi)
Comment 39•8 years ago
|
||
Which would suggest this is a regression from bug 1279568. At least now we have something to back out!
Blocks: 1279568
Assignee | ||
Comment 40•8 years ago
|
||
I know what userContextId == 5 is: this is a private container used to generate thumbnails for about:newtab.
At the beginning the idea was to use privateBrowsing to load this thumbnail for security reasons, but it's not currently possible to have privateBrowserId > 1.
Comment 41•8 years ago
|
||
I still think we have an IPC problem here. (0,2) pairs have also crashed the browser.
Comment 42•8 years ago
|
||
For now, we have 0 reports on Beta 50.0b6. I wonder if the signature has changed.
Comment 43•8 years ago
|
||
Backing out 1279568 will solve the (0,5) mismatches. We should back that out of Beta now. But I don't know why we'd be seeing (0,1-4) mismatches. Containers isn't enabled on Beta. So users would have to explicitly go into about:config and enable them.
Test Pilot for Containers will probably launch in Firefox 50 though. In which case, this could become a Test Pilot crash issue.
We need to figure out why we have the mismatch in the first place.
Comment 44•8 years ago
|
||
For Bug 1279568, instead of backing out, maybe we just need to make it conditional on the privacy.usercontext.enabled pref.
Filed bug 1309699 to track the back out mentioned in comment 39 and 44.
Depends on: 1309699
Comment 46•8 years ago
|
||
Is the crash only an issue for debug builds? Talking to baku earlier, it sounded like it was an assertion failure that would only effect debug builds. But given the number of crashes, it must be a non-debug issue too.
Comment 47•8 years ago
|
||
(In reply to Tanvi Vyas [:tanvi] from comment #46)
> Is the crash only an issue for debug builds? Talking to baku earlier, it
> sounded like it was an assertion failure that would only effect debug
> builds. But given the number of crashes, it must be a non-debug issue too.
No, this was a MOZ_CRASH which turned into a manual crash in Valentin's debugging patch. See <http://searchfox.org/mozilla-central/source/netwerk/ipc/NeckoParent.cpp#185>
Assignee | ||
Comment 48•8 years ago
|
||
This manual assertion was introduced by this patch. Do we want to keep it or it's just for debugging reasons?
Comment 49•8 years ago
|
||
(In reply to Andrea Marchesini [:baku] from comment #48)
> This manual assertion was introduced by this patch. Do we want to keep it or
> it's just for debugging reasons?
No, it was there before.
Before the debug patch, we'd return the string here: <https://hg.mozilla.org/mozilla-central/rev/871153c1b8b2#l1.91> which would have been returned from <http://searchfox.org/mozilla-central/rev/2142de26c16c05f23e543be4fa1a651c4d29604e/netwerk/ipc/NeckoParent.cpp#212> which would cause us to return null from the AllocPHttpChannelParent function <http://searchfox.org/mozilla-central/rev/2142de26c16c05f23e543be4fa1a651c4d29604e/netwerk/ipc/NeckoParent.cpp#263> which would cause us to return MsgValueError from NeckoParent::OnMessageReceived() which would flow through the IPC machinery to get to mozilla::ipc::FatalError <http://searchfox.org/mozilla-central/rev/d96317a351af8aa78ab9847e7feed964bbaac7d7/ipc/glue/ProtocolUtils.cpp#292> which would MOZ_CRASH.
Comment 50•8 years ago
|
||
Ritu, did 1309699 reduce the number of beta crashers?
Valentin or baku, any progress on finding the root cause?
Flags: needinfo?(rkothari)
Comment 51•8 years ago
|
||
(In reply to Tanvi Vyas [:tanvi] from comment #50)
> Ritu, did 1309699 reduce the number of beta crashers?
curiously the content crashes with this signature already ceased in 50.0b6, whereas the backout from bug 1309699 should have just happened in beta 7.
https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_50_0b5_RELEASE&tochange=FIREFOX_50_0b6_RELEASE
Flags: needinfo?(rkothari)
Comment 53•8 years ago
|
||
(In reply to [:philipp] from comment #51)
> (In reply to Tanvi Vyas [:tanvi] from comment #50)
> > Ritu, did 1309699 reduce the number of beta crashers?
>
> curiously the content crashes with this signature already ceased in 50.0b6,
> whereas the backout from bug 1309699 should have just happened in beta 7.
> https://hg.mozilla.org/releases/mozilla-beta/
> pushloghtml?fromchange=FIREFOX_50_0b5_RELEASE&tochange=FIREFOX_50_0b6_RELEASE
I think some IPDL changes made the signature change.
Crash Signature: [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b]
[@ mozilla::net::CrashWithReason ]
[@ mozilla::net::NeckoParent::GetValidatedAppInfo ] → [@ IPCError-browser | (msgtype=0xAE0005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b]
[@ IPCError-browser | (msgtype=0xB00005,name=PNecko::Msg_PHttpChannelConstructor) Value error: message was deserialized, b]
[@ mo…
Comment 54•8 years ago
|
||
(In reply to Valentin Gosu [:valentin] from comment #53)
> (In reply to [:philipp] from comment #51)
> > (In reply to Tanvi Vyas [:tanvi] from comment #50)
> > > Ritu, did 1309699 reduce the number of beta crashers?
> >
> > curiously the content crashes with this signature already ceased in 50.0b6,
> > whereas the backout from bug 1309699 should have just happened in beta 7.
> > https://hg.mozilla.org/releases/mozilla-beta/
> > pushloghtml?fromchange=FIREFOX_50_0b5_RELEASE&tochange=FIREFOX_50_0b6_RELEASE
>
> I think some IPDL changes made the signature change.
Actually, I don't see any more crash reports for the other signature either. If it wasn't the backout that did it, it might have been bug 1277681. Or maybe the signature has changed in another odd way.
Comment 55•8 years ago
|
||
The root cause of this bug has still not been identified. Backing out about:newtab Cookie segregation in https://bugzilla.mozilla.org/show_bug.cgi?id=1309699 helped reduce the number of crashes we see, but the underlying problem still exists.
We don't have a reproducible test case, which makes finding the issue very difficult. From what I understand, it has to do with a race condition when opening or closing a tab.
If we don't fix this in Firefox 50 Beta, the Test Pilot experiment for Containers (which will go into release 50 at the end of Q4) is at risk. This is becasue the experiment will enable Containers and hence make this race condition more prevalent again (the way about:newtab did).
Priority: -- → P1
Whiteboard: [necko-active] → [necko-active][OA]
Comment 56•8 years ago
|
||
I've just noticed that I can reproduce this bug pretty consistently (the 'GetValidatedAppInfo | App does not have permission -(0,2)(0,2)' case) by simply opening a PDF (pdf.js) in a container tab.
Comment 57•8 years ago
|
||
(In reply to Valentin Gosu [:valentin] from comment #54)
Note that even though there are no more crashes on beta, Nightly is still hitting the debug code.
Comment 58•8 years ago
|
||
(In reply to Marco Castelluccio [:marco] from comment #56)
> I've just noticed that I can reproduce this bug pretty consistently (the
> 'GetValidatedAppInfo | App does not have permission -(0,2)(0,2)' case) by
> simply opening a PDF (pdf.js) in a container tab.
Can you specify the STR? I wasn't able to reproduce the crash with any of the PDF files I tried, on either Linux or Windows.
Flags: needinfo?(mcastelluccio)
Comment 59•8 years ago
|
||
I can't reproduce it as reliably as yesterday, not sure why.
I was able to reproduce by simply searching for a PDF on Google and opening it in another tab with a middle click.
Flags: needinfo?(mcastelluccio)
Comment 60•8 years ago
|
||
(In reply to Marco Castelluccio [:marco] from comment #59)
> I can't reproduce it as reliably as yesterday, not sure why.
> I was able to reproduce by simply searching for a PDF on Google and opening
> it in another tab with a middle click.
Exactly what I experienced with comment 2. Maybe some cache issues ?
Comment 61•8 years ago
|
||
I've managed to reproduce this crash pretty consistently (6+ times)!! Hopefully this is what we're looking for... I've added as much information as possible.
Process 45705 stopped
* thread #1: tid = 0x471ccd, 0x0000000101e0fd2e XUL`mozilla::net::CrashWithReason(reason="GetValidatedAppInfo | App does not have permission -(0,5)") + 25 at NeckoParent.cpp:108, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x0000000101e0fd2e XUL`mozilla::net::CrashWithReason(reason="GetValidatedAppInfo | App does not have permission -(0,5)") + 25 at NeckoParent.cpp:108
105 {
106 #ifndef RELEASE_OR_BETA
107 MOZ_CRASH_ANNOTATE(reason);
-> 108 MOZ_REALLY_CRASH();
109 #endif
110 }
111
Example of the STR being used to reproduce the crash:
* https://youtu.be/KDSdHoUm6Vk (attached a crash stack as well)
Perquisites:
* You'll need to create a slack team via https://slack.com/create (it should only take a few seconds to get one up and running)
* upload several PDF's files into one of the slack channels (I used #general in my testing)
STR: (see the above video for an example)
* open the latest version of m-c using ./mach run --debug
* login into the slack team that you've created earlier in a container (I used the personal container)
* click on one of the PDF's that you've uploaded, while the PDF is opening, quickly open a new tab via the keyboard shortcut and close both the PDF that's being loaded and the new tab that you've just created (this needs to be done very quickly)
* repeating the steps will eventually cause the above crash (sometimes it's instant and sometimes it takes 10-20 times)
Once you get the above crash using the above STR, you'll be able to reproduce the crash once again via:
* once the crash occurs, close m-c via the terminal (lldb exit)
* re-open the same browser via ./mach run --debug
* open a new tab and close about:sessionrestore (don't restore)
* leave the new tab opened for a minute or so and the browser will crash
I'm assuming it's trying to load a thumbnail from the previous session which causes the crash without the user doing anything. This might explain why the users are seeing the crash happen a few times in a row and then not being able to reproduce.
Comment 62•8 years ago
|
||
Assignee | ||
Comment 63•8 years ago
|
||
I think I know what the problem is. Here what I have found so far:
1. When we load a PDF, PDF.js does the main http request using the originAttribute of the current window/tab. This means that, if the tab is a container tab, OriginAttributes will be correctly set in the HTTP request.
2. But PDFs are complex things and sometimes, also based on what the server supports, we need to do extra requests. Those are called "HTTP range requests". For how PDF.js is written, those additional requests are done as XHR, in a sandbox, directly by the pdfjs.jsm. So, they run as SystemPrincipal.
3. SystemPrincipal has a 'default' OriginAttributes (with 0 values for each key - mUserContextId is 0, mPrivateBrowsing is 0, etc...).
When HttpChannelChild does: SendPHttpChannelConstructor(..., IPC::SerializedLoadContext(this)...) we serialize this 'default' OriginAttributes.
4. The parent process receives the HttpChannelParentConstructor call. It deserialize the LoadContext and it finds an OriginAttributes with 0 mUserContextId, 0 mPrivateBrowsing, - the 'default' OriginAttributes. Then, it compares this OA with any existing tabContext. Often we have a 'normal' tab opened, so everything is fine -luckily-. But if only container tabs are opened and/or only about:newtabs (because of the userContextId 5 used for thumbnails), we crash.
This seems a reasonable explanation of what happens.
Now, the fix for this issue must be done correctly, and first, I need a feedback from Valentin, or some other necko guy because, it seems to me that we don't support httpChannel actors if they are created by JSM (or anything else using systemPrincipal) in the child process: the comparison is based on the OA of the tab only.
Flags: needinfo?(valentin.gosu)
Assignee | ||
Comment 64•8 years ago
|
||
An easy fix would be to add the full principal instead just OriginAttributes in the serialization of loadContext.
Then the OA comparison is done only if the principal is not the systemPrincipal. But this makes our security model weaker :/
Long term plan is to modify pdf.js in order to use the content Principal everywhere.
Comment 65•8 years ago
|
||
(In reply to Andrea Marchesini [:baku] from comment #64)
> An easy fix would be to add the full principal instead just OriginAttributes
> in the serialization of loadContext.
> Then the OA comparison is done only if the principal is not the
> systemPrincipal. But this makes our security model weaker :/
> Long term plan is to modify pdf.js in order to use the content Principal
> everywhere.
I think we should do this at least for Firefox 50. Then for 51+ we can try the full fix and update the pdf code.
Assignee | ||
Comment 66•8 years ago
|
||
Attachment #8805755 -
Flags: review?(valentin.gosu)
Comment 67•8 years ago
|
||
Comment on attachment 8805755 [details] [diff] [review]
crash_necko.patch
Review of attachment 8805755 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks for digging into this.
::: netwerk/ipc/NeckoParent.cpp
@@ +131,5 @@
> + // If serialized is null, we cannot validate anything. We have to assume that
> + // this requests comes from a SystemPrincipal.
> + if (!aSerialized.IsNotNull()) {
> + return nullptr;
> + }
Shouldn't you also update aAttrs with the Docshell origin attributes? See the end of this method.
I don't know if we need to worry about lowering security here UsingNeckoIPCSecurity() returns false anyway. :)
Attachment #8805755 -
Flags: review?(valentin.gosu) → review-
Assignee | ||
Comment 68•8 years ago
|
||
Attachment #8805755 -
Attachment is obsolete: true
Attachment #8805792 -
Flags: review?(valentin.gosu)
Comment 69•8 years ago
|
||
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check
Review of attachment 8805792 [details] [diff] [review]:
-----------------------------------------------------------------
Looks great. Thanks!
Please nominate for uplifting after landing.
Attachment #8805792 -
Flags: review?(valentin.gosu) → review+
Comment 70•8 years ago
|
||
(In reply to Valentin Gosu [:valentin] from comment #69)
> Looks great. Thanks!
> Please nominate for uplifting after landing.
Also I'm not sure this fixes the container bug, as we still fall through when !contextArray.IsEmpty() and aSerialized.IsNotNull()
Flags: needinfo?(valentin.gosu) → needinfo?(amarchesini)
Comment 71•8 years ago
|
||
Baku, let me know if you need any help testing this patch.
Assignee | ||
Comment 72•8 years ago
|
||
(In reply to Kamil Jozwiak [:kjozwiak] from comment #71)
> Baku, let me know if you need any help testing this patch.
I think this patch fixes the crash, but, as you know, it's hard to reproduce the issue. If you want to spend some times on monday, or when you want, this would be great.
Flags: needinfo?(amarchesini)
Assignee | ||
Updated•8 years ago
|
Assignee: valentin.gosu → amarchesini
Comment 73•8 years ago
|
||
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4a6cd6dff941
Security check in NeckoParent::GetValidatedAppInfo must pass if the serialized LoadContext is null, r=valentin
Assignee | ||
Comment 74•8 years ago
|
||
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check
Approval Request Comment
[Feature/regressing bug #]: IPC necko http actors
[User impact if declined]: The security check is wrongly done
[Describe test coverage new/current, TreeHerder]: super racy. a test cannot be written.
[Risks and why]: The patch is super simple: the check !aSerialized.IsNotNull() was missing in NeckoParent::GetValidatedAppInfo. The patch looks bigger because I moved some if() statements around.
[String/UUID change made/needed]: none
Attachment #8805792 -
Flags: approval-mozilla-beta?
Attachment #8805792 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 75•8 years ago
|
||
> Also I'm not sure this fixes the container bug, as we still fall through
> when !contextArray.IsEmpty() and aSerialized.IsNotNull()
In that case, I guess we have 2 separate bugs. What you are describing is not a JSM loading some resources in the child process, bug a bug in how we propagate userContextId.
Assignee | ||
Comment 76•8 years ago
|
||
I didn't remove the debugString. We can see if we trigger this issue again also with this patch.
Comment 77•8 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Comment 78•8 years ago
|
||
I haven't been able to reproduce the crash for now with today's build (I was able to reproduce it consistently with yesterday's build on Slack). I'll report back if I manage to reproduce it again.
I was looking at Nightly52 crash data and the last signature is still visible on 10/31 builds. Perhaps this fix doesn't work for all attached signatures.
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check
Let's uplift to Aurora51 to stabilize. We need a bit more data before deciding to uplift this in 50 RC builds.
Attachment #8805792 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 81•8 years ago
|
||
(In reply to Andrea Marchesini [:baku] from comment #76)
> I didn't remove the debugString. We can see if we trigger this issue again
> also with this patch.
(In reply to Ritu Kothari (:ritu) from comment #79)
> I was looking at Nightly52 crash data and the last signature is still
> visible on 10/31 builds. Perhaps this fix doesn't work for all attached
> signatures.
The null serialized load info was only a small part of this. It was only after I reviewed the patch that I realized it still doesn't fix the main reason for the mismatch: when the child sends userContextId:5 and all the open windows have userContextId:0, or the opposite (0,5).
Comment 82•8 years ago
|
||
I'm going to reopen since it sounds like only part of the problem is fixed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 83•8 years ago
|
||
I've hit this again, but in a different scenario: bp-06928b3c-1e2d-40c5-866b-564522161101.
I'm not sure what caused it. I didn't have any container tab open.
Assignee | ||
Comment 84•8 years ago
|
||
str |
Good news: I know why we crash. Debugging this issue with Marco (thanks a lot btw) we found why this issue happens. Here the STR:
1. Open in a container tab https://gauntface.github.io/simple-push-demo/
2. enable notification
3. copy the CURL command
4. open a normal tab and close any other container tab.
5. wait 5-10 seconds
6. execute the command - BOOM.
The reason of the crash is this: the ServiceWorker of the simple-push-demo is executed when a push notification is received. This SW runs with the correct UCI but we don't have any container tab of that kind. So we crash.
Comment 85•8 years ago
|
||
Woohoo \o/!! Awesome work Baku/Marco :)
The STR are a little different under macOS 10.12.1
* Open a container tab and load https://gauntface.github.io/simple-push-demo/
* enable notifications via the sliding menu
** make sure there's an entry saved under notifications via about:preferences#content
* copy the CURL command from the container tab where notifications have been enabled
* open a new regular tab and load the same website (https://gauntface.github.io/simple-push-demo/)
** notifications shouldn't be enabled in this tab
* close the original container tab where you've enabled notifications so only the default tab is opened
* launch a terminal and run the CURL command that you copied earlier
You'll get the following crash instantly:
Process 65407 stopped
* thread #1: tid = 0xad6881, 0x000000010260356c XUL`mozilla::net::CrashWithReason(reason="GetValidatedAppInfo | App does not have permission -(1,0)") + 25 at NeckoParent.cpp:105, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x000000010260356c XUL`mozilla::net::CrashWithReason(reason="GetValidatedAppInfo | App does not have permission -(1,0)") + 25 at NeckoParent.cpp:105
102 {
103 #ifndef RELEASE_OR_BETA
104 MOZ_CRASH_ANNOTATE(reason);
-> 105 MOZ_REALLY_CRASH();
106 #endif
107 }
108
Assignee | ||
Comment 86•8 years ago
|
||
Attachment #8806378 -
Flags: review?(valentin.gosu)
Attachment #8806378 -
Flags: review?(bugmail)
Comment 87•8 years ago
|
||
Comment on attachment 8806378 [details] [diff] [review]
part 2 - SW+Push
Review of attachment 8806378 [details] [diff] [review]:
-----------------------------------------------------------------
With the context of the bug building on our IRC conversation, I think I have an improvement. Please see the NeckoParent.cpp comment.
I'm shifting r? to :bkelly since although you (:baku) can defer reviews to me for this, I think any ServiceWorkerManager change still wants his awareness/sign-off, but I'm happy to do any follow-up review steps if :bkelly is cool with it in order to help reduce his review burden and/or because this is the area I'm messing with for remoting ServiceWorkerPrivate.
:bkelly, the quick context dump is that NeckoParent does a security check in GetValidatedAppInfo to make sure the request from the content process is legit and it only knows about TabContexts as tracked by the ContentProcessManager, which of course is entirely unaware of what ServiceWorkers get up to. In this case the push service has decided to spin up a ServiceWorker in response to a push notification in the content process (which it does itself from within the content process). Glossing a little bit, because the TabContext origin attribute checks are very granular, as long as there is a TabContext with the same (container) mUserContextId, things are cool so usually everything works out. But when someone is using the non-default container and a service worker gets involved, the potential for crashes goes way up since you're less likely to luck into a TabContext with the same mUserContextId.
::: netwerk/ipc/NeckoParent.cpp
@@ +224,5 @@
>
> return nullptr;
> }
>
> + // Maybe this can be a ServiceWorker
It looks like this check occurs too late. Specifically, I think the ServiceWorker checks wants to occur prior to the `if (contextArray.IsEmpty()) {` check which can induce a crash.
This matters if a ContentParent can exist without having any TabContexts. Since in this very case the existence of the ServiceWorker is decoupled from the existence of a TabContext, it seems appropriate.
Having said that, I think right now the ContentParent/ContentProcessManager may kill the process once the number of TabContexts goes to zero, *but*:
- I think :gabor may be changing that for efficiency in the nProcesses > 1 case (if he or :mrkbap haven't already).
- My changes for ServiceWorkerPrivate remoting currently call for a dedicated content process to be used which is not tracked by the ContentProcessManager. This means the GetManagedTabContext() call will assert in ContentProcessManager::GetTabContextByContentProcess() and it'll need to be changed.
So in addition to moving the check up, I suggest we change
HasRegistrations(nsIPrincipal* aPrincipal);
to instead be:
MayHaveActiveServiceWorkerInstance(ContentParent* aContent, nsIPrincipal* aPrincipal);
The comment on MayHaveActiveServiceWorkerInstance would be something like:
"""
Return true if the given content process could potentially be executing service worker code with the given principal. At the current time, this just means that we have any registration for the origin, regardless of scope. This is a very weak guarantee but is the best we can do when push notifications can currently spin up a service worker in content processes without our involvement in the parent process.
In the future when there is only a single ServiceWorkerManager in the parent process that is entirely in control of spawning and running service worker code, we will be able to authoritatively indicate whether there is an activate service worker in the given content process. At that time we will rename this method HasActiveServiceWorkerInstance and provide semantics that ensure this method returns true until the worker is known to have shut down in order to allow the caller to induce a crash for security reasons without having to worry about shutdown races with the worker.
"""
Attachment #8806378 -
Flags: review?(bugmail) → review?(bkelly)
Assignee | ||
Comment 88•8 years ago
|
||
Attachment #8806453 -
Flags: review?(valentin.gosu)
Assignee | ||
Comment 89•8 years ago
|
||
Andrew, thanks for you comments!
Attachment #8806378 -
Attachment is obsolete: true
Attachment #8806378 -
Flags: review?(valentin.gosu)
Attachment #8806378 -
Flags: review?(bkelly)
Attachment #8806455 -
Flags: review?(bkelly)
Comment 90•8 years ago
|
||
Using the STR from comment#85, you can reproduce this under Win 10 Pro x64 as well.
Comment 91•8 years ago
|
||
Comment on attachment 8806453 [details] [diff] [review]
part 3 - test
Review of attachment 8806453 [details] [diff] [review]:
-----------------------------------------------------------------
I <3 tests!
Attachment #8806453 -
Flags: review?(valentin.gosu) → review+
Comment 92•8 years ago
|
||
Comment on attachment 8806455 [details] [diff] [review]
part 2 - SW+Push
Review of attachment 8806455 [details] [diff] [review]:
-----------------------------------------------------------------
::: netwerk/ipc/NeckoParent.cpp
@@ +98,5 @@
> return kPBOverride_Unset;
> }
>
> +static already_AddRefed<nsIPrincipal>
> +GetRequestingPrincipalFromOptionalLoadInfoArgs(const OptionalLoadInfoArgs aOptionalLoadInfoArgs)
Maybe pass the arg as a reference?
And how about a shorter name? GetRequestingPrincipal should be clear enough.
@@ +214,5 @@
>
> return nullptr;
> }
>
> + // Maybe this can be a ServiceWorker
It would be nice if this explained why it is necessary to check for active service workers.
Assignee | ||
Comment 93•8 years ago
|
||
Here the comment and the shorter name. I cannot pass a reference of nsIPrincipal because PrincipalInfoToPrincipal allocates a new nsIPrincipal and I want to keep this code safe.
Attachment #8805792 -
Attachment is obsolete: true
Attachment #8806455 -
Attachment is obsolete: true
Attachment #8805792 -
Flags: approval-mozilla-beta?
Attachment #8806455 -
Flags: review?(bkelly)
Attachment #8806604 -
Flags: review?(valentin.gosu)
Attachment #8806604 -
Flags: review?(bkelly)
Assignee | ||
Comment 94•8 years ago
|
||
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check
We still need this patch in beta.
Attachment #8805792 -
Attachment description: crash_necko.patch → Part 1 - IsNotNull() check
Attachment #8805792 -
Attachment is obsolete: false
Attachment #8805792 -
Flags: approval-mozilla-beta?
Comment 95•8 years ago
|
||
Comment on attachment 8806604 [details] [diff] [review]
part 2 - SW+Push
Review of attachment 8806604 [details] [diff] [review]:
-----------------------------------------------------------------
r=me on the Necko bits. Thanks!
Attachment #8806604 -
Flags: review?(valentin.gosu) → review+
Comment 96•8 years ago
|
||
Comment on attachment 8806604 [details] [diff] [review]
part 2 - SW+Push
Review of attachment 8806604 [details] [diff] [review]:
-----------------------------------------------------------------
I spoke with Andrea about this on IRC. I think this is a bandaid because we don't really know if the content process is running the service worker. Its a good step for now, though, and is worth doing to fix the crash in beta.
Please add a comment and follow-up bug that this check should be removed or improved once we move to our new e10s service worker model.
Attachment #8806604 -
Flags: review?(bkelly) → review+
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check
I don't think the entire fix is ready yet, clearing the beta nom.
Attachment #8805792 -
Flags: approval-mozilla-beta?
Comment 98•8 years ago
|
||
(In reply to Ben Kelly [:bkelly] from comment #96)
> Comment on attachment 8806604 [details] [diff] [review]
> part 2 - SW+Push
>
> Review of attachment 8806604 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> I spoke with Andrea about this on IRC. I think this is a bandaid because we
> don't really know if the content process is running the service worker. Its
> a good step for now, though, and is worth doing to fix the crash in beta.
>
> Please add a comment and follow-up bug that this check should be removed or
> improved once we move to our new e10s service worker model.
Hi Ben,
Are you saying the bandaid may not fix the crash? Or the bandaid will fix the crash, but there is more followup work to do here.
Comment 99•8 years ago
|
||
(In reply to Ritu Kothari (:ritu) from comment #97)
> Comment on attachment 8805792 [details] [diff] [review]
> Part 1 - IsNotNull() check
>
> I don't think the entire fix is ready yet, clearing the beta nom.
Once baku lands this in inbound, we can re-nominate all the necessary patches for beta. I'm hoping that will be tomorrow.
Comment 100•8 years ago
|
||
(In reply to Tanvi Vyas [:tanvi] from comment #98)
> Are you saying the bandaid may not fix the crash? Or the bandaid will fix
> the crash, but there is more followup work to do here.
I'm saying its not actually providing any anti-spoofing protection like our TabContext. We are basically just allowing it through anyway if we have a SW registration matching the principal. We have no positive confirmation that SW is running in the child process creating the channel.
Long term we will have a data structure in the parent process that knows where service workers should be running. At that point we can do a better job of preventing someone from spoofing a child process service worker.
Comment 101•8 years ago
|
||
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0473f2d0d024
NeckoParent::GetValidatedAppInfo should consider ServiceWorkers when validating HttpChannel requests, r=bkelly, r=valentin, f=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/40e351ced11b
Test about Containers, SW and PushNotifications, r=valentin
Comment 102•8 years ago
|
||
Backed out for eslint failure (missing globals):
https://hg.mozilla.org/integration/mozilla-inbound/rev/439965937366d0c2a504f8ca8cab1ed07be24fe7
https://hg.mozilla.org/integration/mozilla-inbound/rev/0bf3ab8a2a94a1d1b8299e1e9395f0361b1b7dba
Push with failure: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=40e351ced11b25aa0bdb8bd572e6e68ba1609c18
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=38582322&repo=mozilla-inbound
[task 2016-11-02T21:03:31.378029Z] TEST-UNEXPECTED-ERROR | /home/worker/checkouts/gecko/browser/components/contextualidentity/test/browser/browser_pushnotification.js:44:26 | 'ChromeUtils' is not defined. (no-undef)
[task 2016-11-02T21:03:31.378102Z] TEST-UNEXPECTED-ERROR | /home/worker/checkouts/gecko/browser/components/contextualidentity/test/browser/pushworker.js:1:1 | 'onpush' is not defined. (no-undef)
Flags: needinfo?(amarchesini)
Comment 103•8 years ago
|
||
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a4ee836c49a3
NeckoParent::GetValidatedAppInfo should consider ServiceWorkers when validating HttpChannel requests, r=bkelly, r=valentin, f=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/8a072e37f540
Test about Containers, SW and PushNotifications, r=valentin
Assignee | ||
Comment 104•8 years ago
|
||
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check
Ritu, we really need this for beta. As we need the following patch.
Together the fix the issue completely.
The reason why need them is because otherwise we could crash in this scenario: on mac, FF is running but without tabs/windows. A push notification is received. We crash because this necko check fails.
Flags: needinfo?(amarchesini)
Attachment #8805792 -
Flags: approval-mozilla-beta?
Assignee | ||
Comment 105•8 years ago
|
||
Comment on attachment 8806604 [details] [diff] [review]
part 2 - SW+Push
See previous comments
Attachment #8806604 -
Flags: approval-mozilla-beta?
Attachment #8806604 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 106•8 years ago
|
||
Note that these 2 patches do not fix a container problem. It's more generic than this because it can happen (I'm not 100% sure yet) in mac when no windows/tabs are opened and a push notification is received.
Comment 107•8 years ago
|
||
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5ec89b4b058e
Backed out changeset 8a072e37f540 for failures in browser_pushnotification.js
Comment 108•8 years ago
|
||
str |
(In reply to Andrea Marchesini [:baku] from comment #106)
> Note that these 2 patches do not fix a container problem. It's more generic
> than this because it can happen (I'm not 100% sure yet) in mac when no
> windows/tabs are opened and a push notification is received.
Steps to reproduce:
1. Open in a tab https://gauntface.github.io/simple-push-demo/
2. Enable notifications
3. Copy the cURL command
4. Close all tabs (but make sure Firefox is still running)
5. Wait 5-10 seconds
6. Execute the cURL command
Marcia was able to reproduce this crash with Nightly, but not with Beta.
https://crash-stats.mozilla.com/report/index/dee1c77c-792f-4457-b61c-1d4a52161103
Andi tried with Nightly and was able to crash when he had a tab open (without any tab open Firefox didn't crash but didn't receive the notification either).
https://crash-stats.mozilla.com/report/index/b65d67df-fa3d-4973-b567-a6bba2161103
With the patch applied he wasn't able to reproduce the crash (but didn't receive the notification either, he only received it if https://gauntface.github.io/simple-push-demo/ was open).
Comment 109•8 years ago
|
||
I went through the same STR from comment#108 using the latest debug beta and couldn't reproduce the issue either.
Build Used:
* fx50.0b12, buildId: 20161103105945, changeset: a0a503f568d8
Results:
* https://pastebin.mozilla.org/8924554
Comment 110•8 years ago
|
||
Comment on attachment 8806604 [details] [diff] [review]
part 2 - SW+Push
Let's stabilize on 51 before we decide whether to uplift to Beta50 or not.
Attachment #8806604 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 111•8 years ago
|
||
It looks like betas after b6 aren't affected (with the exception of 5 crash reports in Beta 9), so it might be a container-specific bug.
https://crash-stats.mozilla.com/search/?signature=~PNecko%3A%3AMsg_PHttpChannelConstructor&signature=%3Dmozilla%3A%3Anet%3A%3ACrashWithReason&signature=%3Dmozilla%3A%3Anet%3A%3ANeckoParent%3A%3AGetValidatedAppInfo&product=Firefox&version=50.0b&date=%3E%3D2016-10-27T18%3A59%3A00.000Z&date=%3C2016-11-03T18%3A59%3A00.000Z&_sort=-date&_facets=signature&_facets=version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version
Comment 112•8 years ago
|
||
Hi Andrei, could you please have your team repro this issue on Beta50 (see comment 108)? Most folks are unable to repro on Beta50 which make me feel this is not a must uplift to RC2. Thank you!
Flags: needinfo?(andrei.vaida)
Comment 113•8 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Comment 114•8 years ago
|
||
I've tested this on 50b6, 50b11, 52.0a2 (2016-11-04) on OS X 10.11 and no crashes occurred.
50b6 - without any tab open (Fx was still running in the background), Firefox didn't crash and receive the notification. Same result with 1 tab open.
50b11, 52.0a2 (2016-11-04) - without any tab open, Firefox didn't crash but didn't receive the notification either. With 1 tab open, Firefox didn't crash and receive the notification.
Flags: needinfo?(andrei.vaida)
Assignee | ||
Comment 115•8 years ago
|
||
Another reason why it _could_ crash, is with signedPkg. But I don't know enough about this signed packages to know if this is an issue here.
Flags: needinfo?(tanvi)
Comment 116•8 years ago
|
||
(In reply to Andrea Marchesini [:baku] from comment #115)
> Another reason why it _could_ crash, is with signedPkg. But I don't know
> enough about this signed packages to know if this is an issue here.
signedPkg was implemented for New Security Model in Firefox OS.
I think this attribute is no longer needed. Maybe we should remove it at some point of time.
Comment 117•8 years ago
|
||
(In reply to Andrea Marchesini [:baku] from comment #115)
> Another reason why it _could_ crash, is with signedPkg. But I don't know
> enough about this signed packages to know if this is an issue here.
I don't know much about this either. Is signedPkg still used for anything?
Flags: needinfo?(tanvi)
Flags: needinfo?(ptheriault)
Flags: needinfo?(dveditz)
Flags: needinfo?(allstars.chh)
Comment 118•8 years ago
|
||
We still don't know that this crash won't occur in Firefox 50 release, particularly with the Test Pilot experiment for Containers. I'm not sure why the STR aren't working in beta7, but the STR were very hard to track down in the first place. The technical issue that was patched for 51 and 52 exists in 50 as well, even though we can't seem to reproduce the crash.
Comment 119•8 years ago
|
||
Paul, have you tested both with e10s turned off and turned on?
Flags: needinfo?(paul.silaghi)
Comment 120•8 years ago
|
||
Wes or tomcat, did this land in aurora? Thanks.
Flags: needinfo?(wkocher)
Flags: needinfo?(cbook)
Comment 121•8 years ago
|
||
bugherder uplift |
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #120)
> Wes or tomcat, did this land in aurora? Thanks.
It did now.
Flags: needinfo?(wkocher)
Flags: needinfo?(cbook)
But I believe it caused bustage like https://treeherder.mozilla.org/logviewer.html#?job_id=4067687&repo=mozilla-aurora so I just backed it out:
https://hg.mozilla.org/releases/mozilla-aurora/rev/dbcc5abc7d34
Flags: needinfo?(amarchesini)
Comment 124•8 years ago
|
||
This late in 50 cycle, I am unable to justify uplifting a patch for a TestPilot project. We should let this fix stabilize on Beta51 for a few weeks and ship with 50.1.0 if deemed safe.
Flags: needinfo?(allstars.chh)
Comment 125•8 years ago
|
||
(In reply to Marco Castelluccio [:marco] from comment #119)
> Paul, have you tested both with e10s turned off and turned on?
Just with e10s=ON.
Tried again, with e10s=OFF:
50b6, 50b11, 52.0a2 (2016-11-04) - without any tab open (Fx was still running in the background), Firefox didn't crash and receive the notification. Same result with 1 tab open.
Flags: needinfo?(paul.silaghi)
Assignee | ||
Comment 126•8 years ago
|
||
Bug 1301406 must be uplifted too otherwise these 2 patches do not apply to m-a.
Flags: needinfo?(amarchesini)
Comment 127•8 years ago
|
||
(In reply to Tanvi Vyas [:tanvi] from comment #117)
> Is signedPkg still used for anything?
Doesn't look like it. It should go when the appID stuff does.
Updated•8 years ago
|
Flags: needinfo?(dveditz)
Comment 128•8 years ago
|
||
(In reply to Tanvi Vyas [:tanvi] from comment #117)
> I don't know much about this either. Is signedPkg still used for anything?
Valentin already removed signedPkg from Origin Attributes in bug 1315302.
Comment 129•8 years ago
|
||
has problems to uplift like applying crash_necko.patch
patching file netwerk/ipc/NeckoParent.cpp
Hunk #1 FAILED at 115
Hunk #2 FAILED at 167
2 out of 2 hunks FAILED -- saving rejects to file netwerk/ipc/NeckoParent.cpp.rej
patch failed, unable to continue (try -v)
patch failed, rejects left in working directory
errors during apply, please fix and qrefresh crash_necko.patch
Flags: needinfo?(amarchesini)
Comment 130•8 years ago
|
||
Note that we're still seeing crashes on Nightly (https://crash-stats.mozilla.com/search/?signature=%3Dmozilla%3A%3Anet%3A%3ANeckoParent%3A%3AGetValidatedAppInfo&signature=%3Dmozilla%3A%3Anet%3A%3ACrashWithReason&product=Firefox&date=%3E%3D2016-11-02T14%3A34%3A00.000Z&date=%3C2016-11-09T14%3A34%3A00.000Z&_sort=-date&_facets=signature&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-build_id), so there's likely at least a third possible crash other than the pdf.js one and the web push one.
Comment 132•8 years ago
|
||
(In reply to Andrea Marchesini [:baku] from comment #131)
> Tomcat, see comment 126.
ok uplifting now bug 1301406 but there seems to be still problems here for this patch:
373003:a4ee836c49a3 "Bug 1289001 - NeckoParent::GetValidatedAppInfo should consider ServiceWorkers when validating HttpChannel requests, r=bkelly, r=valentin, f=asuth"
merging dom/workers/ServiceWorkerManager.cpp
merging netwerk/ipc/NeckoParent.cpp
merging netwerk/ipc/NeckoParent.h
merging netwerk/protocol/wyciwyg/WyciwygChannelParent.cpp
warning: conflicts while merging netwerk/ipc/NeckoParent.cpp! (edit, then use 'hg resolve --mark')
abort: unresolved conflicts, can't continue
(use 'hg resolve' and 'hg graft --continue')
Assignee | ||
Comment 133•8 years ago
|
||
Flags: needinfo?(cbook)
Assignee | ||
Comment 134•8 years ago
|
||
Ok, I suspect what it's happening here. But I need a feedback from Bill. Between PBrowserChild::Send__delete__(mTabChild); and TabChild::ActorDestroy we run the event loop. So if an httpChannel is asking for IPCOpen(), we return true. But, in the parent process, we delete the tabContext when Recv__delete__ is received.
I guess the solution is to move mIPCOpen immediately before calling PBrowserChild::Send__delete__(mTabChild).
Flags: needinfo?(wmccloskey)
Assignee | ||
Comment 135•8 years ago
|
||
Flags: needinfo?(wmccloskey)
Attachment #8809723 -
Flags: review?(wmccloskey)
Updated•8 years ago
|
Flags: needinfo?(cbook)
Updated•8 years ago
|
Attachment #8809425 -
Attachment is patch: true
Comment 136•8 years ago
|
||
Untagging as blocking release, leaving it tracked for 50 in case we have a safe, well-tested fix ready for uplift before 50.1.0 release go-live.
Comment 137•8 years ago
|
||
Reopening since there's still patches landing here. We should be cognizant of the next Monday's uplift though and consider using a new bug for further work to avoid making this already-complicated bug more difficult to track.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla52 → ---
Updated•8 years ago
|
Attachment #8798043 -
Flags: checkin+
Updated•8 years ago
|
Attachment #8805792 -
Flags: checkin+
Updated•8 years ago
|
Attachment #8806453 -
Attachment is obsolete: true
Attachment #8806453 -
Flags: checkin-
Updated•8 years ago
|
Attachment #8806604 -
Flags: checkin+
Updated•8 years ago
|
Attachment #8809425 -
Flags: checkin+
Comment 138•8 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-aurora/rev/b922ccc8628d
https://hg.mozilla.org/releases/mozilla-aurora/rev/f3bbd45dcc7a
https://hg.mozilla.org/releases/mozilla-aurora/rev/0116b93b47e9
Flags: in-testsuite?
Comment 139•8 years ago
|
||
Comment on attachment 8809723 [details] [diff] [review]
part 3 - tabChild
Review of attachment 8809723 [details] [diff] [review]:
-----------------------------------------------------------------
(Taking over the review at :baku's request after discussing it on IRC.)
::: dom/ipc/TabChild.h
@@ +748,5 @@
> {
> mUnscaledInnerSize = aSize;
> }
>
> + void Send__delete__();
Nit: it would be clearer if this method had a different name from the IPDL Send__delete__, maybe SendDeleteIfOpen or something like that.
Attachment #8809723 -
Flags: review?(wmccloskey) → review+
Comment 140•8 years ago
|
||
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/419fcafe1794
Fix a race condition in the use of TabChild::mIPCOpen, r=jld
Assignee | ||
Updated•8 years ago
|
Attachment #8809723 -
Attachment description: part 4 - tabChild → part 3 - tabChild
Attachment #8809723 -
Flags: approval-mozilla-beta?
Attachment #8809723 -
Flags: approval-mozilla-aurora?
(In reply to Andrea Marchesini [:baku] from comment #134)
> Ok, I suspect what it's happening here. But I need a feedback from Bill.
> Between PBrowserChild::Send__delete__(mTabChild); and TabChild::ActorDestroy
> we run the event loop.
Why do you think that? Looking at the code for Send__delete__:
http://searchfox.org/mozilla-central/source/__GENERATED__/ipc/ipdl/PBrowserChild.cpp#2040
It calls DestroySubtree, which calls ActorDestroy:
http://searchfox.org/mozilla-central/source/__GENERATED__/ipc/ipdl/PBrowserChild.cpp#5172
Flags: needinfo?(amarchesini)
Assignee | ||
Comment 142•8 years ago
|
||
Comment on attachment 8809723 [details] [diff] [review]
part 3 - tabChild
You are right. Tomcat, can you revert https://hg.mozilla.org/integration/mozilla-inbound/rev/419fcafe1794 ?
Attachment #8809723 -
Attachment is obsolete: true
Flags: needinfo?(amarchesini) → needinfo?(cbook)
Attachment #8809723 -
Flags: approval-mozilla-beta?
Attachment #8809723 -
Flags: approval-mozilla-aurora?
Flags: needinfo?(cbook) → needinfo?(amarchesini)
Flags: needinfo?(wmccloskey)
I had to back this out from aurora as well for debug xpcshell failures like https://treeherder.mozilla.org/logviewer.html#?job_id=4132020&repo=mozilla-aurora
https://hg.mozilla.org/releases/mozilla-aurora/rev/3d380055aaed
Assignee | ||
Updated•8 years ago
|
Flags: needinfo?(amarchesini)
Assignee | ||
Comment 146•8 years ago
|
||
Actually I don't see any new crash. All the crashes are coming from not-updated nightly. Marco, can you confirm this?
Flags: needinfo?(wmccloskey) → needinfo?(mcastelluccio)
Comment 147•8 years ago
|
||
We still have some crashes on Nightly:
https://crash-stats.mozilla.com/search/?signature=%3Dmozilla%3A%3Anet%3A%3ANeckoParent%3A%3AGetValidatedAppInfo&signature=%3Dmozilla%3A%3Anet%3A%3ACrashWithReason&product=Firefox&version=52.0a1&_sort=-date&_facets=signature&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-build_id
The latest affected build ID is 20161113030203 (which should be the latest released one).
Flags: needinfo?(mcastelluccio)
Assignee | ||
Comment 148•8 years ago
|
||
Interesting: in this crash-report list, there are just crashes with userContextId 5. No containers involved.
Comment 149•8 years ago
|
||
(In reply to Andrea Marchesini [:baku] from comment #148)
> Interesting: in this crash-report list, there are just crashes with
> userContextId 5. No containers involved.
What's userContextId 5?
Assignee | ||
Comment 150•8 years ago
|
||
> What's userContextId 5?
about:newtab uses userContextId in a hidden xul:browser, used for thumbnail generation.
Comment 151•8 years ago
|
||
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check
We might want to take this fix in 50.1.0.
Attachment #8805792 -
Flags: approval-mozilla-release?
Attachment #8806604 -
Flags: approval-mozilla-release?
Comment 152•8 years ago
|
||
This bug is getting messy to track given the number of comments/landings/backouts and today's merges on top of that. Let's call this specific bug fixed and file a new bug to track any further work that needs to happen on trunk for any remaining problem spots. And obviously we still need to sort out the branch uplifts here.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Updated•8 years ago
|
Attachment #8809425 -
Attachment description: patch 2 for m-a → patch 2 for 51
Comment 153•8 years ago
|
||
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check
Since merge day is passed and we may still need the patch in 51 beta, I remove the approval‑mozilla‑aurora+ flag and keep approval‑mozilla‑beta? for 51.
Attachment #8805792 -
Flags: approval-mozilla-aurora+
Updated•8 years ago
|
Attachment #8806604 -
Flags: approval-mozilla-aurora+
Comment 154•8 years ago
|
||
Hi baku,
Since the patch for 51 was backed out because of some assert errors, I'm still waiting for the new patches for 51 beta. Need info you to put this in your radar.
Flags: needinfo?(amarchesini)
Assignee | ||
Comment 155•8 years ago
|
||
Attachment #8809425 -
Attachment is obsolete: true
Flags: needinfo?(amarchesini)
Updated•8 years ago
|
Attachment #8811718 -
Attachment is patch: true
Comment 156•8 years ago
|
||
Comment on attachment 8806604 [details] [diff] [review]
part 2 - SW+Push
Might as well move the requests over to the patch that'd actually be going to Beta/Release if approved.
Attachment #8806604 -
Flags: approval-mozilla-release?
Attachment #8806604 -
Flags: approval-mozilla-beta?
Updated•8 years ago
|
Attachment #8811718 -
Flags: approval-mozilla-release?
Attachment #8811718 -
Flags: approval-mozilla-beta?
Comment 157•8 years ago
|
||
Hi Ryan,
Do you know Which one is for beta uplift request? both or patch 2 for 51 only?
Flags: needinfo?(ryanvm)
Comment 159•8 years ago
|
||
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check
Fix a crash. Beta51+. Should be in 51 beta 2.
Attachment #8805792 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Updated•8 years ago
|
Attachment #8811718 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 160•8 years ago
|
||
bugherder uplift |
Comment 161•8 years ago
|
||
Follow-up bustage fix for RELEASE_OR_BETA builds.
https://hg.mozilla.org/releases/mozilla-aurora/rev/f7e648e838578f2095f606602202117f0645ed04
https://hg.mozilla.org/releases/mozilla-beta/rev/6296ed3dbefd133bba324230ec4f5a07d37041e1
Comment 162•8 years ago
|
||
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a206b81a650d
Remove the MOZ_NORETURN hint from CrashWithReason. r=valentin
Comment 163•8 years ago
|
||
bugherder |
Updated•8 years ago
|
Flags: needinfo?(ptheriault)
Comment 164•8 years ago
|
||
Comment on attachment 8805792 [details] [diff] [review]
Part 1 - IsNotNull() check
Top crasher on 50, meets the triage bar for inclusion in 50.1.0
Attachment #8805792 -
Flags: approval-mozilla-release? → approval-mozilla-release+
Attachment #8811718 -
Flags: approval-mozilla-release? → approval-mozilla-release+
Comment 165•8 years ago
|
||
bugherder uplift |
Comment 166•8 years ago
|
||
Backed out for bustage.
https://hg.mozilla.org/releases/mozilla-release/rev/5d8d14cdf400
https://treeherder.mozilla.org/logviewer.html#?job_id=436149&repo=mozilla-release
Flags: needinfo?(amarchesini)
Assignee | ||
Comment 167•8 years ago
|
||
Is it too late for submitting a new patch for release?
Flags: needinfo?(amarchesini) → needinfo?(ryanvm)
Updated•8 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•