Closed Bug 1328768 Opened 5 years ago Closed 4 years ago

Crash in xpcObjectHelper::xpcObjectHelper

Categories

(Toolkit :: Crash Reporting, defect)

Unspecified
Windows 10
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox-esr45 --- unaffected
firefox51 --- unaffected
firefox52 --- unaffected
firefox53 + fixed
firefox54 + fixed

People

(Reporter: marcia, Assigned: gsvelto)

References

Details

(4 keywords, Whiteboard: [pre-existing but exacerbated by e10s-multi?])

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-8d22df4b-5697-463d-b416-fa4c82170105.
=============================================================

Seen while looking at nightly crash stats: http://bit.ly/2iFVAgx. Crash started using 20170104030214 build. 

Possible regression range: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=c91249f41e3766274131a84f9157a4d9d9949520&tochange=57ac9f63fc6953f4efeb0cc84a60192d3721251f

Not sure if this is bucketed in the right component - could use some help there.
These appear to be fairly constant in crash rate back to August... and a wide set of versions.  Also a wide range of addresses, including UAFs.  Note that given the variety of stacks there are probably multiple bugs here!
Group: core-security
Flags: needinfo?(overholt)
Bobby, are you the best person to investigate?
Flags: needinfo?(overholt) → needinfo?(bobbyholley)
(In reply to Andrew Overholt [:overholt] from comment #2)
> Bobby, are you the best person to investigate?

I don't have the cycles to unfortunately. Will require some sleuthing, ideally by someone at least somewhat familiar with xpconnect code. mccr8 or njn could be good candidates if they can be persuaded.
Flags: needinfo?(bobbyholley)
These crashes are dying inside a QI. I'm not sure how you can really investigate that. Kanru, maybe somebody on your team could poke around in a crash dump a little to see if there's something useful? The crash volume looks pretty low, so I'm not sure how worthwhile it is to spend much time on looking at this. It could just be a sign of some earlier generic memory corruption or hardware failure.
Flags: needinfo?(kchen)
I looked at some minidumps with our without "shutdown" in the stacks, all of them are QI to nsIAsyncShutdownBlocker. Which means we might be using some async shutdown blocker after free.

I found one case bp-a807ce9e-4206-4265-b8d6-569ef2170108 which has mozilla::ipc::AsyncMinidumpAnalyzer::Run in the stacks.

My theory is when we successively call CrashReporterHost::AsyncAddCrash here[1] we will fail to add the blocker the second time here[2][3], then we capture the |this| variable here[4] by value but didn't increase the refcnt so when we finally want to remove the second blocker[5] it's already deleted.

I haven't be able to reproduce this with my theory though.

[1]: http://searchfox.org/mozilla-central/rev/bf98cd4315b5efa1b28831001ad27d54df7bbb68/ipc/glue/CrashReporterHost.cpp#187
[2]: http://searchfox.org/mozilla-central/rev/bf98cd4315b5efa1b28831001ad27d54df7bbb68/ipc/glue/CrashReporterHost.cpp#94
[3]: JavaScript error: [...]components/nsAsyncShutdown.js, line 123: Error: We have already registered a distinct blocker with the same name: Crash reporter: blocking on minidump analysis
[4]: http://searchfox.org/mozilla-central/rev/bf98cd4315b5efa1b28831001ad27d54df7bbb68/ipc/glue/CrashReporterHost.cpp#113
[5]: http://searchfox.org/mozilla-central/rev/bf98cd4315b5efa1b28831001ad27d54df7bbb68/ipc/glue/CrashReporterHost.cpp#123
Flags: needinfo?(kchen) → needinfo?(gsvelto)
Group: core-security → toolkit-core-security
Component: XPConnect → Breakpad Integration
Product: Core → Toolkit
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #5)
> My theory is when we successively call CrashReporterHost::AsyncAddCrash
> here[1] we will fail to add the blocker the second time here[2][3], then we
> capture the |this| variable here[4] by value but didn't increase the refcnt
> so when we finally want to remove the second blocker[5] it's already deleted.

This sounds plausible, I hadn't thought of the use case where we'd get two content crashes in rapid succession. I'll try to cook up a patch that solves this as well as simulate the crash by manually calling the function twice before releasing the blocker. Leaving the NI for now.
The regression range in comment 0 includes enabling e10s-multi, so perhaps that causes this to happen more.
(In reply to Andrew McCreight [:mccr8] from comment #7)
> The regression range in comment 0 includes enabling e10s-multi, so perhaps
> that causes this to happen more.

If it's a race when calling CrashReporterHost::AsyncAddCrash() twice before the shutdown barrier can be removed then it's likely that e10s-multi made it a lot more likely (e.g. by having two content processes crashing at the same time).
Whiteboard: [pre-existing but exacerbated by e10s-multi?]
Flags: needinfo?(gsvelto)
Taking this, patch coming.
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Here's my WIP for fixing this. I'm now capturing this using an explicit initializer so that the object should now remain alive within the lambda function even if it's not being held by anything else. Does this look sane to you?
Attachment #8831874 - Flags: feedback?(kchen)
BTW I couldn't reproduce this but it makes sense. nsAsyncShutdownClient.removeBlocker() tries to read the name from within the pointer that's passed to itM so if that object has already been destroyed and the allocator used its memory for something else then a crash is the most likely outcome.

My guess is that this didn't blow up on me because I never had two content processes crash at the same time. Even when t hat happens there's still a chance that everything will work fine unless the freed object has already been overwritten.
(In reply to Gabriele Svelto [:gsvelto] from comment #11)
> Even when t hat happens there's still a
> chance that everything will work fine unless the freed object has already
> been overwritten.

Our allocator overwrites most objects with a poison value immediately when it frees them.
(In reply to Andrew McCreight [:mccr8] from comment #12)
> Our allocator overwrites most objects with a poison value immediately when
> it frees them.

Is the entire object overwritten or just the header (vtable, etc...)? If it is then we should reliably crash but then the race might be too narrow to reproduce it easily. I couldn't reproduce it locally but it's still the most likely explanation - if addBlocker() fails then this line is touching a dead object and should crash:

https://dxr.mozilla.org/mozilla-central/rev/1fe66bd0efba89df59d2046e8c91418eb5ae10b8/toolkit/components/asyncshutdown/nsAsyncShutdown.js#104
We overwrite the entire allocation
Comment on attachment 8831874 [details] [diff] [review]
[PATCH WIP] Always keep the minidump analyzer off-main-thread runnable object alive until we remove the shutdown blocker

Review of attachment 8831874 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/CrashReporterHost.cpp
@@ +110,5 @@
> +      self = Move(self),
> +      processType = mProcessType,
> +      crashType = mCrashType,
> +      childDumpID = mChildDumpID
> +    ] {

This looks OK. Although according to https://developer.mozilla.org/en-US/docs/Using_CXX_in_Mozilla_code we can't use initialized lambda captures in our code yet. I guess you can list the used variables directly, no need to use the alias name.
Attachment #8831874 - Flags: feedback?(kchen) → feedback+
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #15)
> This looks OK. Although according to
> https://developer.mozilla.org/en-US/docs/Using_CXX_in_Mozilla_code we can't
> use initialized lambda captures in our code yet. I guess you can list the
> used variables directly, no need to use the alias name.

Thanks, I've noticed another instance of an initialized lambda captures in the code:

https://dxr.mozilla.org/mozilla-central/rev/1d025ac534a6333a8170a59a95a8a3673d4028ee/netwerk/socket/nsNamedPipeService.cpp#45

Maybe the doc wasn't updated yet? Either way if I can do without it I'll do.
Asking Benjamin to review this since he also reviewed the patch that introduced this code. The try run is here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=910a6fbc0ea5a3d21c875c1cdcb4054dbfde55ca
Attachment #8831874 - Attachment is obsolete: true
Attachment #8833348 - Flags: review?(benjamin)
Comment on attachment 8833348 [details] [diff] [review]
[PATCH] Always keep the minidump analyzer runnable object alive until we remove the shutdown blocker

I must admit I had to look up the C++ syntax for lambdas the first time I reviewed this, and I don't understand how non-POD locals entrained via lambdas work. So I'm going to bounce this to somebody who hopefully does.
Attachment #8833348 - Flags: review?(benjamin) → review?(continuation)
Attachment #8833348 - Flags: review?(continuation) → review?(nfroyd)
Comment on attachment 8833348 [details] [diff] [review]
[PATCH] Always keep the minidump analyzer runnable object alive until we remove the shutdown blocker

Review of attachment 8833348 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!  We should have caught this in the lambda capture static analysis; I filed bug 1336510 on that.
Attachment #8833348 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #19)
> Thanks!  We should have caught this in the lambda capture static analysis; I
> filed bug 1336510 on that.

Thanks for the review; it would be excellent if we covered this with static analysis because it's surprisingly easy to shoot yourself in the foot with this. It's the kind of mistake that you'll easily keep in mind after you've done it once but which is pretty easy to do since we don't have much code yet mixing ref-counted objects with lambda captures.

The try run is here, all green save for the usual intermittents which I've re-triggered:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=910a6fbc0ea5a3d21c875c1cdcb4054dbfde55ca
Comment on attachment 8833348 [details] [diff] [review]
[PATCH] Always keep the minidump analyzer runnable object alive until we remove the shutdown blocker

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

The problem is caused in a race in the chain of methods and runnables called when a content process dies to analyze the crash. It's almost impossible to trigger the patch if e10s-multi is disabled (the content process needs to crash twice in a very short amount of time); when multi-e10s is enabled then two pages living in two different content processes need to crash the content processes almost at the same time.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

I've tried to keep the comment as neutral as possible, it describes the process without mentioning that one can access freed memory via the dead object this pointer nor the conditions to trigger it.

Which older supported branches are affected by this flaw?

The patch that introduced the problem has also landed on aurora:

https://hg.mozilla.org/releases/mozilla-aurora/log?rev=a848fea61a8f

If not all supported branches, which bug introduced the flaw?

The flaw was introduced in bug 1293656.

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

The patch on this bug applies cleanly to mozilla-aurora too.

How likely is this patch to cause regressions; how much testing does it need?

The patch is higly unlikely to cause regressions. The change only adds and uses an additional variable to keep an object alive. I've tested it manually on Linux, Windows and Mac.
Attachment #8833348 - Flags: sec-approval?
Maybe we should file a follow-up bug for the crashes that predate bug 1293656?
Comment on attachment 8833348 [details] [diff] [review]
[PATCH] Always keep the minidump analyzer runnable object alive until we remove the shutdown blocker

sec-approval+ for trunk. Please nominate a patch for Aurora as well so we can avoid shipping the issue.
Attachment #8833348 - Flags: sec-approval? → sec-approval+
(In reply to Al Billings [:abillings] from comment #23)
> sec-approval+ for trunk. Please nominate a patch for Aurora as well so we
> can avoid shipping the issue.

Since it's the first time I handle a security bug I've got a silly question: should I wait for the patch to merged into mozilla-central, then graft it on top of aurora and ask approval for that, or should I just attach another patch that applies cleanly to aurora?
Flags: needinfo?(abillings)
No need to attach a branch patch unless rebasing is required. Otherwise, the commit that landed on m-c will be grafted onto Aurora once approval is granted (noting that sec-approval is *not* the same as uplift approval). For now, just get the patch landed on inbound :)
Flags: needinfo?(abillings)
Oh, I see, I thought I'd have to ask for sec-approval for the aurora patch, nevermind. I've pushed this one to inbound:

https://hg.mozilla.org/integration/mozilla-inbound/rev/7b1c47953224fa956a138915fc998efcf6ac5226
If the existing patch applies cleanly, just nominate it for aurora and it will be approved there. Otherwise, you should attach an aurora specific patch and nominate it.
https://hg.mozilla.org/mozilla-central/rev/7b1c47953224
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Comment on attachment 8833348 [details] [diff] [review]
[PATCH] Always keep the minidump analyzer runnable object alive until we remove the shutdown blocker

Approval Request Comment
[Feature/Bug causing the regression]: bug 1293656
[User impact if declined]: potential use-after-free when the affected code hits a race caused by two almost simultaneous crashes
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: I've manually tested it, reproducing the race is complex without deliberately widening it
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: No
[Why is the change risky/not risky?]: It's a very small change that only ensures that an object remains alive while the runnable that touches it is still pending
[String changes made/needed]: None
Attachment #8833348 - Flags: approval-mozilla-aurora?
Comment on attachment 8833348 [details] [diff] [review]
[PATCH] Always keep the minidump analyzer runnable object alive until we remove the shutdown blocker

Sec-high, crash fix, let's take this for aurora. Thanks Gabriele.
Attachment #8833348 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Group: toolkit-core-security → core-security-release
Keywords: regression
I'm marking this as qe-verify- as there's not much QA can do here without STR to reliably trigger the crash. Taking a quick look at the crashes, it looks like these changes have fixed the problem. If there's a reliable way to reproduce this crash, please needinfo me and I'll take another look.
Flags: qe-verify-
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.