Closed Bug 988878 Opened 6 years ago Closed 6 years ago

Maverick crash (?) on latest nightly with this testcase

Categories

(Core :: Graphics, defect)

x86
macOS
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla31
Tracking Status
firefox29 --- unaffected
firefox30 + fixed
firefox31 + fixed
firefox-esr24 --- unaffected

People

(Reporter: martijn.martijn, Assigned: bent.mozilla)

References

Details

(5 keywords)

Attachments

(3 files)

Attached file crash1.htm
See testcase:
<html>
<head>
</head>
<body style="-moz-column-count: 1;transform: scale(99);" onload="this.removeAttribute('style');">
<embed type="application/x-shockwave-flash"/>
<button style="position: fixed; top: 999999999px;">m</button>
</body>
</html>

With this testcase, my 10.9.2 MacOSX OS seems to crash on the latest nightly build. It doesn't seem to crash with Firefox29beta.
Crash report ID or stack?
Flags: needinfo?(martijn.martijn)
I don't get anything, the whole OS seems to lock or every application seems to close and the only thing I'm allowed to do is a restart.
Flags: needinfo?(martijn.martijn)
Same here. I'm not sure it's a security bug given that it takes down the whole machine, but let me see if I can catch anything in the debug build.
Running under debugger doesn't stop it.  Not clear what state the computer is in at that point. The display is just white, can't switch to other applications (or see that I've switched.)  If I sleep and weak up, I will get the login prompt, but after logging in, the screen is still white. If I'm remotely connected to the computer at the time we hit this bug, I get disconnected.  Connecting again, firefox process is gone. Killing the loginwindow process logs me out and then things work properly.
Turning off OMTC stops the crash. Switching to skia from cg doesn't make a difference. With OMTC on, forcing basic also doesn't make a difference.
Last thing that seems to happen before we crash is:

IPDL protocol error: unknown union type
###!!! ASSERTION: IPDL error [PLayerTransactionParent]: "unknown union type". Killing child side as a result.: 'Error', file /Users/msreckovic/Repos/mozilla-central/ipc/glue/ProtocolUtils.cpp, line 190
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(bent.mozilla)
Complete shot in the dark - we send the message with scale, attribute gets removed, we end up in getting an invalid type? (and call abort).  The whole "kicked out of ssh" is interesting...
Integrated or discrete GPU, makes no difference.
Come to think of it, "kicked out of ssh" is probably tty going away?
Process id in  ProtocolUtils.FatalError is -1. That means kill all processes owned by the user, or perhaps all processes (if superuser), right? That seems... dangerous? Or am I misinterpreting the meaning of pid of -1?
pid -1 I think means 'cross thread actor' rather than 'cross process actor'.
Flags: needinfo?(bent.mozilla)
Hmm.  Right now if we call KillProcess(pid=-1,...), it calls system kill(-1, SIGTERM), then kill(-1,SIGKILL) if necessary. Which is supposed to terminate/kill all user processes.
Flags: needinfo?(nical.bugzilla)
Yikes. I guess we were just calling kill with a random pid before bug 974356, and now we're taking down the whole system. Amazing.

I think we would have caught this sooner if chromium's pid type and processhandle types were distinct...
Blocks: 974356
Attached patch Patch, v1Splinter Review
Assignee: nobody → bent.mozilla
Status: NEW → ASSIGNED
Attachment #8398882 - Flags: review?(benjamin)
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #15)
> Created attachment 8398882 [details] [diff] [review]
> Patch, v1

Is the right thing here to not do anything at all (this case) or actually abort the whole thing? Your change just ignores the FatalError call - nothing gets killed at all - is that the right thing?
When we do avoid aborting in the FatalError code, I can get the intermittent crash here (usually if loading the page from the command line):
[28044] ###!!! ABORT: Aborting on channel error.: file /Users/msreckovic/Repos/mozilla-central/ipc/glue/MessageChannel.cpp, line 1522
so it may actually be better to crash when we hit this FatalError.
FatalError is meant to be called when we need to kill a child process. For cross-thread actors there's nothing to kill since we implicitly trust our own process. It's just a programmer bug that causes this problem, not a hacked or misbehaving child process. How we handle that is a bit debatable I guess.

I think in theory we should be able to maintain the same model for both cross-process and cross-thread actors where we tear down the channel and send ActorDestroy messages to the child and handle everything cleanly. It might be tough though, and aborting if we get it wrong is probably the only option.

Admittedly error handling in the cross-thread actor code has probably been tested much less than the cross-process case. I would expect you to reliably hit the abort in comment 17, not just sometimes.
:bsmedberg, (admittedly overeager) review ping :)

:bent, the bug 974356 is in Aurora/30 right now, so we will want an uplift; is the patch the same?
Flags: needinfo?(benjamin)
(In reply to Milan Sreckovic [:milan] (TPE timezone) from comment #19)
> :bent, the bug 974356 is in Aurora/30 right now, so we will want an uplift;
> is the patch the same?

Yeah, assuming this is what we want to do here.

We should really get someone working on comment 6 though!
Yeah, comment 6 is probably a non-security and a separate bug(?), but, I agree we should look at it and I've been doing that a bit.
This sounds like a pretty bad DoS if someone could force this so obviously we need to fix it, but doesn't sound like a severe "security" bug.
Attachment #8398882 - Flags: review?(benjamin) → review+
Flags: needinfo?(benjamin)
Comment on attachment 8398882 [details] [diff] [review]
Patch, v1

[Security approval request comment]
How easily could an exploit be constructed based on the patch?
Very difficult imo.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?
No.

Which older supported branches are affected by this flaw?
See flags.

If not all supported branches, which bug introduced the flaw?
Bug 974356 exposed this, but the faulty code has been here essentially ever since we started doing off-main-thread-compositing.

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?
Same patch.

How likely is this patch to cause regressions; how much testing does it need?
This shouldn't affect anything except preventing us from taking down all of OS X :)
Attachment #8398882 - Flags: sec-approval?
Comment on attachment 8398882 [details] [diff] [review]
Patch, v1

Sec-approval+ for trunk. Let's get branch patches created and nominated too.
Attachment #8398882 - Flags: sec-approval? → sec-approval+
Comment on attachment 8398882 [details] [diff] [review]
Patch, v1

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 974356
User impact if declined: Killing all apps on OS X when we have an OMTC layers bug
Testing completed (on m-c, etc.): on m-i now
Risk to taking this patch (and alternatives if risky): This should be very safe
String or IDL/UUID changes made by this patch: None
Attachment #8398882 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/8b908fdf7bc7
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Ok, this seems to be fixed. But when I load the testcase in the latest Nightly build, it causes Firefox to not repaint anything, so when I switch from one tab to the tab with that page loaded, the previous tab is still shown. I filed bug 992999 for it.
Status: RESOLVED → VERIFIED
Depends on: 992999
Attachment #8398882 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Group: core-security
Flags: in-testsuite?
You need to log in before you can comment on or make changes to this bug.