Closed Bug 780770 Opened 8 years ago Closed 8 years ago
Content can prevent chrome buttons from working
STR: 1) Load https://metrics.mozilla.com/data/content/pentaho-cdf-dd/Render?solution=metrics2&path=%2Ftelemetry&file=telemetryHistogram.wcdf 2) Login using Mozilla Persona with any email address. 3) While the loading spinners are shown for a histogram, try to click the new tab button (may need to have tab overflow or customize the toolbar for it to be shown). Alternatively, try clicking buttons on the error console while the in-content throbbers are shown. Some histograms take longer to load than others and make it easier to reproduce. Actual results: Browser is still responsive but various UI buttons don't work (e.g. new tab button or error console buttons). The buttons show the depressed UI but their actions are not taken (and don't seem to be queued). Closing the tab still works which puts the browser back to a usable state. Expected results: UI buttons should work like they normally do on other pages. Last good nightly: 2012-01-15 First bad nightly: 2012-01-16 Pushlog: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=21c84409902e&tochange=047c8ba7d2e4 Marking as a security bug for now since security UI is also affected such as the "Get me out of here" button on the reported attack site notification bar. I haven't yet figured out what the page is doing or made a reduced test-case. Reproduced on OS X 10.7 and Windows 7. I'm going to narrow down the regression range to the specific commit now.
Regression caused by bug 716167 changeset a0e3fb36fb1a: changeset: 84460:a0e3fb36fb1a user: Bobby Holley <email@example.com> date: Sat Jan 14 10:31:16 2012 -0800 summary: Bug 716167 - Only push null contexts in XPConnect for main thread events, and remove infrastructure from bug 326777. r=bz
On the surface looks like a Denial of Service vuln, but given the regressing bug might be worse. Need Bobby to take a look.
Component: General → XPConnect
I just poked at this a little bit. I see a couple of interesting things in the debug spew: WARNING: NS_ENSURE_SUCCESS(rv, false) failed with result 0x8000FFFF: file /files/mozilla/repos/i/content/base/src/nsContentUtils.cpp, line 2929 WARNING: NS_ENSURE_TRUE(pusher.Push(aBoundElement)) failed: file /files/mozilla/repos/i/content/xbl/src/nsXBLProtoImplMethod.cpp, line 329 but they don't seem to be reliable or consistent. This stuff isn't really my area of expertise, and my spidey senses tell me I'm probably not the best person to investigate it. Probably better to have someone with more front-end expertise figure out why exactly the button isn't working.
I took a look again yesterday with GDB and it seems like sync XHR is how it's triggered with pentaho but Felipe hit this today with the Java plugin on the stack instead. (with nsAppShell::ProcessNextNativeEvent followed by the HandleEvent calls). Something interesting to me (for someone who doesn't know this code) was nsEventStatus_eIgnore but perhaps that is normal? (gdb) bt 1 #0 HandleEvent (aEvent=0x7fff5fbf6350) at /Users/matthew/mozilla-central/view/src/nsView.cpp:130 (gdb) p result $31 = nsEventStatus_eIgnore
I just hit this bug again, and the stack involved a plugin running its message handler and spinning the event loop. Matthew was able to reproduce with his STR and the stack involved does not have plugins, but has synchronous XHR which also spins the event loop. So it seems that is a good lead. It looks like some kind of situation ends up not being able to unwind and decrease the event loop depth. Here is a profile I got after the browser was inside this situation: http://people.mozilla.com/~bgirard/cleopatra/?report=7f0a49d5aec2f78a9235c31877dacac02a3841df (choose `Invert callstack` and then keep expanding the nsDOMEvent::GetTargetForFrame stack) The stack is not 100% meaningful because the browser was not hanging, and nsHTMLButtonElement::PostHandleEvent is (from my best guess) me clicking on the profiler's "Start/Stop/Analyze" buttons. But it seems to imply that the whole browser is running inside the nested loop. FWIW in my case, closing all tabs and killing plugin-container did not resolve the situation. Firefox was also unable to quit by itself (the window disappeared but the process never went way) so I had to kill it. There were also some ghost compartments left from tabs I closed. But all of this is probably just symptons of the bug (instead of causes).
Regression in January and no movement in two months. Can we propose a security rating and get this assigned to someone to fix?
Yeah, if content can cause us to get stuck inside a nested event loop, that's not super-great. On the surface it could be used to DoS, but other things might go wrong to, so it's really hard to say. I'm going to propose sec-moderate, but I think we should actively get someone working on it.
I've seen this too. Could we back out Bug 716167?
Bug 716167 makes use call push and pop in an odd way. First we push and run then other thread observers, run the runnable, then we pop and run then other thread observers. Before Bug 716167 we always first pushed null, then run thread observers, run the runnable run thread observers and then pop.
(In reply to Olli Pettay [:smaug] from comment #9) > Bug 716167 makes use call push and pop in an odd way. s/use/us/
(In reply to Olli Pettay [:smaug] from comment #8) > I've seen this too. Could we back out Bug 716167? Nope. XPCPerThreadData is long gone (bug 755255), so touching XPConnect off-main-thread definitely won't work anymore. Also, bz said the old behavior (doing this stuff off-main-thread) was a major bug that he never intended. If there's something wrong with the way the new infrastructure pushes/pops, we should just fix that.
There is a major problem with push/pop, as far as I see. No one can rely on right context (null) to be on the stack. I think that can easily lead to sg;crit problems.
(In reply to Bobby Holley (:bholley) from comment #11) > (In reply to Olli Pettay [:smaug] from comment #8) > > I've seen this too. Could we back out Bug 716167? > > Nope. XPCPerThreadData is long gone (bug 755255), so touching XPConnect > off-main-thread definitely won't work anymore. Also, bz said the old > behavior (doing this stuff off-main-thread) was a major bug that he never > intended. Bug 716167 is not really about touching non-main thread objects. It is about pushing null context, especially about pushing null context in main thread.
(In reply to Olli Pettay [:smaug] from comment #12) > There is a major problem with push/pop, as far as I see. > No one can rely on right context (null) to be on the stack. > I think that can easily lead to sg;crit problems. I totally support fixing this however necessary. I'm just saying that it will have to be a new patch, because backing out bug 716167 isn't an option. (In reply to Olli Pettay [:smaug] from comment #13) > Bug 716167 is not really about touching non-main thread objects. > It is about pushing null context, especially about pushing null context in > main thread. It is. There are two patches in that bug, the first (what you're referring to) being a prerequisite for the second (which is the threading part). Part 1 of that bag removed code that caused us to touch XPConnect off the main thread, which now results in a MOZ_CRASH. This is why we can't just back that patch out.
So the problem we're solving is that thread observers other than nsXPConnect might get called with a non-null cx on the stack? And that causes problems for some of them?
Comment on attachment 673840 [details] [diff] [review] patch r=me
Attachment #673840 - Flags: review?(bzbarsky) → review+
(In reply to Boris Zbarsky (:bz) from comment #16) > So the problem we're solving is that thread observers other than nsXPConnect > might get called with a non-null cx on the stack? And that causes problems > for some of them? Yes. I could debug this some more if you want to know exactly which observer is causing the problem. But the patch would be the same. There is now an obvious bug in cx pushing/popping.
Nah, no need to debug to figure out the relevant observer. Just a bit mystified as to what observers we have around at all... ;)
Comment on attachment 673840 [details] [diff] [review] patch Looks good. Thanks for taking care of this. r=bholley Were you able to reproduce this bug? Did this fix it?
Attachment #673840 - Flags: review?(bobbyholley+bmo) → review+
Oh, also - can you add a comment explaining why we're doing this?
(In reply to Bobby Holley (:bholley) from comment #20) > Were you able to reproduce this bug? Did this fix it? Yes, this happens all the time with https://metrics.mozilla.com/data/ And no, I don't want to add any comments before we have this released.
Comment on attachment 673840 [details] [diff] [review] patch [Security approval request comment] How easily can the security issue be deduced from the patch? I'm not even 100% sure there is real security bug. We end up calling stuff with content cx on stack so things which should work, don't. Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? No Which older supported branches are affected by this flaw? Not ESR10 If not all supported branches, which bug introduced the flaw? Bug 716167 Do you have backports for the affected branches? The patch applies (with some fuzz) to all the branches. How likely is this patch to cause regressions; how much testing does it need? The patch brings back the old behavior which was implemented in Bug 326777
Attachment #673840 - Flags: sec-approval?
Comment on attachment 673840 [details] [diff] [review] patch This appears safe enough to check in right now. We should make patches for applicable branches and nominate them.
Attachment #673840 - Flags: sec-approval? → sec-approval+
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment on attachment 673840 [details] [diff] [review] patch [Approval Request Comment] Bug caused by (feature/regressing bug #): bug 716167 User impact if declined: content page may block chrome UI. Possibly security bugs Testing completed (on m-c, etc.): landed m-c Risk to taking this patch (and alternatives if risky): The patch effectively backs out bug 716167 String or UUID changes made by this patch: NA
Comment on attachment 673840 [details] [diff] [review] patch [Triage Comment] Let's get this on Aurora given where we are in the cycle, but no need to uplift to Beta since if it is exploited, the user can just stop going to the affected website to avoid DoS. Please re-nominate for approval-mozilla-beta if we got that wrong.
I've been trying to verify the fix for this bug. For the broken build, I've used trunk builds from 2012-01-16 to 2012-08-06, on Mac/Win/Linux. I don't see the behavior described here; performance is acceptable and I'm able to use the UI, spawn new tabs, etc., while a histogram is being built. I suppose something could have changed on the web site itself. I have no way of knowing. I'd love advice from Olli or anyone else w/r/t reproducing this. Also, I'd like to ask the original bug filer - and anyone else who can repro - if they'd like to go back and verify that a) the bug still repros on known bad builds/configs, and b) the latest builds no longer hang. Thank you.
It is possible that the site has changed. But I don't see much reason to verify this one. The patch fixed an obvious problem.
I verified the fix using my STR from comment 0 shortly after it landed. I would guess the page has since switched to async XHR which may not cause the problem.  http://pedroalves-bi.blogspot.com/2012/11/making-cdf-calls-asynchronous.html
Status: RESOLVED → VERIFIED
(In reply to Matthew N. [:MattN] from comment #31) > I verified the fix using my STR from comment 0 shortly after it landed. I > would guess the page has since switched to async XHR which may not cause > the problem. Does that mean this is not verifiable in Firefox 18 and 19?
Whiteboard: [adv-main18-] → [adv-main18-][qa?]
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #32) > (In reply to Matthew N. [:MattN] from comment #31) > > I verified the fix using my STR from comment 0 shortly after it landed. I > > would guess the page has since switched to async XHR which may not cause > > the problem. > > Does that mean this is not verifiable in Firefox 18 and 19? It is verifiable if someone makes or finds a new testcase but I don't know the exact requirements for it. FWIW, I did verify when 19 was on m-c. Olli seems to be fine without further verification so this is probably fine as-is.
Target Milestone: --- → mozilla19
Okay, thanks Matthew. I'm marking this verified for Firefox 19 based on your testing.
You need to log in before you can comment on or make changes to this bug.