Closed Bug 1553742 Opened 3 years ago Closed 3 years ago

Crash in [@ mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::ipc::IPDLParamTraits<T>::Write]

Categories

(Core :: DOM: Security, defect, P1)

68 Branch
Unspecified
Windows 10
defect

Tracking

()

VERIFIED FIXED
mozilla69
Tracking Status
firefox-esr60 --- unaffected
firefox67 --- unaffected
firefox67.0.1 --- unaffected
firefox68 --- unaffected
firefox69 + verified

People

(Reporter: calixte, Assigned: ckerschb)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: crash, regression, topcrash, Whiteboard: [domsecurity-active])

Crash Data

Attachments

(2 files)

This bug is for crash report bp-8b9c1734-8389-4fa0-b212-4c2df0190522.

Top 10 frames of crashing thread:

0 xul.dll mozilla::ipc::FatalError ipc/glue/ProtocolUtils.cpp:262
1 xul.dll mozilla::ipc::IProtocol::HandleFatalError ipc/glue/ProtocolUtils.cpp:473
2 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::ipc::PrincipalInfo>::Write ipc/ipdl/PBackgroundSharedTypes.cpp:860
3 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::ipc::CSPInfo>::Write ipc/ipdl/PBackgroundSharedTypes.cpp:1078
4 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::Maybe<mozilla::ipc::CSPInfo> >::Write ipc/glue/IPDLParamTraits.h:231
5 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::dom::IPCClientInfo>::Write ipc/ipdl/ClientIPCTypes.cpp:178
6 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::Maybe<mozilla::dom::IPCClientInfo> >::Write ipc/glue/IPDLParamTraits.h:231
7 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::net::LoadInfoArgs>::Write ipc/ipdl/NeckoChannelParams.cpp:558
8 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::Maybe<mozilla::net::LoadInfoArgs> >::Write ipc/glue/IPDLParamTraits.h:231
9 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::net::HttpChannelOpenArgs>::Write ipc/ipdl/NeckoChannelParams.cpp:1307

There are 65 crashes (from 15 installations) in nightly 69 starting with buildid 20190522093426. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1547218.
The MOZ_CRASH_REASON is always: IPDL error: "unknown union type". abort()ing as a result.

[1] https://hg.mozilla.org/mozilla-central/rev?node=15b4b7f9d97a

Flags: needinfo?(nika)

This is the set of commits added in Nightly 20190522093426: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=257f2c96cef502a1d674df56c8e39d76d8ed4d89&tochange=5f95b3f2ea44723ba6a8c41a4b27c88032df709f

Bug 965637, a huge change to how we deal with CSP, including serialization and deserialization, seems like a much more likely regressor.

Component: IPC → DOM: Security
Flags: needinfo?(nika) → needinfo?(ckerschb)
Regressed by: 965637
No longer regressed by: 1547218

This is the #1 non-shutdown hang crash on OSX for the May 22 Nightlies, with 14 crashes from 3 installations.

:jkt, can you please take a look while I am out. As soon as I am back I'll take over of course. Please also hold back with landing you Principal serialization patches till we have sorted this problem out. Thanks!

Flags: needinfo?(jkt)

Hmm, not sure how my changes would be impacting this, but here's an explanation as to what's going on:

So unfortunately, trying to send a T__None (which is the default) union type over IPC causes this crash. This is caused whenever an IPDL union has been overridden. It looks like the cause here is that the PrincipalInfo object was default-initialized when creating the CSPInfo, and not overridden with a different value.

From some poking around, I found at least one place where this can happen. In the method CSPToCSPInfo if aCSP is nullptr, the CSPInfo will not be explicitly initialized, and instead will be left in its default-initialized state. This leaves the PrincipalInfo member in an invalid state, and can trigger this crash. https://searchfox.org/mozilla-central/rev/662de518b1686c4769320d6b8825ce4864c4eda0/ipc/glue/BackgroundUtils.cpp#213-215

I'm not entirely sure how we would've gotten in that particular situation, but it seems like https://searchfox.org/mozilla-central/rev/662de518b1686c4769320d6b8825ce4864c4eda0/dom/workers/WorkerLoadInfo.cpp#117 is the main call site where we don't null-check the nsIContentSecurityPolicy argument.

Sending a ni? to :ckerschb because I think he's touched this code before.

--

Just a little note here to explain why this weird T__None situation exists - basically in order for ParamTraits to work properly, we need to have a default constructor so that the target outparameter can be initialized & passed in to ParamTraits::Read. The crash when sending that default state over IPC was made because you're theoretically not supposed to do this.

It may be possible/reasonable to allow that value to be serialized, but I have no strong opinions there.

It looks like this is actually the top crash across all platforms.

[Tracking Requested - why for this release]: top crash

I tried loading a few URLs from the crash reports and then reloading the browser, but that didn't trigger the crash.

Keywords: topcrash

It looks to me like the code should have been returning a failure, checking with try if it causes no problems: https://treeherder.mozilla.org/#/jobs?repo=try&revision=48fbf9680cb0352e7a09b2ba170204f8e0fb45fb

I haven't been able to replicate the crash but based on Comment 4 this looks like the issue and if it doens't cause any other issues with tests I think it's worth trying to land to see if it resolves the issue.

Assignee: nobody → jkt
Status: NEW → ASSIGNED
Flags: needinfo?(jkt)

(In reply to Andrew McCreight [:mccr8] from comment #5)

I tried loading a few URLs from the crash reports and then reloading the browser, but that didn't trigger the crash.

This is a really annoying crash. For a bit more context, this consistently crashes on my Nightly on latest Arch Linux x86-64 (Gnome with Wayland), on some pages. Most prominent are web.telegram.org and Google Spreadsheets. FWIW both sites load halfway through and crash just before becoming interactive (there is a flash of page content for half a second before the crash hits).

My Nightly is my day-to-day browser, but is fairly light on addons. I do run Webrender as an opt-in. I'm happy to email someone more info or test builds to help get this fixed, feel free to email me at flaki@moco or ping me on Slack.

I have a patch that I just sent out for testing, if you could try the build on there that likely would be helpful: https://treeherder.mozilla.org/#/jobs?repo=try&revision=65363b4d97e6235f020e200777bc3a4706d85781 (target file) I have no idea how broken this will be for our test suite yet.

I no longer instantiate the CSPInfo anymore on the callsites and instead use Maybe<CSPInfo> to pass an owned return. I also noticed WorkerPrivate::StoreCSPOnClient() wasn't checking if the CSPInfo existed, this might have been the cause of the other perma fail regressions of the bug 965637 that caused this.

Due to the crash handle being so generic the stats aren't reliable for this bug. I have also struggled to replicate the crash, however I have been working on the assumption that Nika posted in Comment 4 as there aren't other call sites that create CSPInfo.

Flags: needinfo?(bugzilla)

Ignore that :flaki this is still crashing, however I think I have narrowed down the error to one place. Your examples make sense as they likely use workers.

Flags: needinfo?(bugzilla)

Please note that I can always replicate this crash on https://www.getdigital.de/Matias-Tastatur-Tactile-Pro-FK302-DE.html. The tab's content crashes as soon as the site has finished loading. Maybe it will be helpful for verification.

Flags: needinfo?(ckerschb)
Priority: -- → P1
Whiteboard: [domsecurity-active]

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #11)

Please note that I can always replicate this crash on https://www.getdigital.de/Matias-Tastatur-Tactile-Pro-FK302-DE.html. The tab's content crashes as soon as the site has finished loading. Maybe it will be helpful for verification.

:whimboo, I wasn't able to reproduce the crash using https://www.getdigital.de/Matias-Tastatur-Tactile-Pro-FK302-DE.html. Obviously a testcase would be super helpful to get this problem sorted is. Question is, can you reproduce using a fresh profile as well? Anything else you can think of so I can reproduce the problem? Thanks for your help!

Flags: needinfo?(hskupin)

The preference necessary to trigger the crash for me is security.csp.enable. You will have to set it to false.

Flags: needinfo?(hskupin)
Blocks: 1555292

Finally I got a chance to take a look at what's going on here. As :nika already indicated within comment 4 we are creating a CSPInfo which then does not get initialized correctly because aCSP is null here [1]. This problem occurs only for worker contexts because we do the correct null checks on the callsites for all other contexts.
I was able to reproduce the problem by

The problem is indeed that the CSP was null here [1] which caused the CSPInfo to end up in a half initialized state.

Here is what I suggest:
a) We do the proper null checks in worker contexts before creating a CSPInfo within this bug, and
b) We refactor CSPToCSPInfo to not use an outgoing argument but to return a CSPInfo, because currently we might still end up with a half way initialized cspInfo in case something is bogus with the principal [1].

However, I assume that (a) already fixes the vast majority of problems.

[1] https://hg.mozilla.org/mozilla-central/rev/ddf4012a7652#l60.103

Assignee: jkt → ckerschb
Pushed by mozilla@christophkerschbaumer.com:
https://hg.mozilla.org/integration/autoland/rev/b31e0897ef74
Null Check CSP before calling CSPToCSPInfo to avoid half initialized state of cspinfo. r=baku,mccr8
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69
See Also: → 1555120
Flags: qe-verify+

Confirmed issue with 69.0a1 (2019-05-23) on Windows 10.
Fix verified with 69.0b11 on Windows 10, macOS 10.13, Ubutnu 16.04.
A brief navigation on the mentioned website revealed no other issues/crashes as well.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.