Closed Bug 708825 Opened 13 years ago Closed 13 years ago

"ASSERTION: Fault in cycle collector: traversed refs exceed refcount" closing window nearScriptStackLimit

Categories

(Core :: XPCOM, defect)

defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla12
Tracking Status
firefox8 - wontfix
firefox9 - wontfix
firefox10 - wontfix
firefox11 - wontfix
firefox12 + verified
firefox13 + verified
firefox14 + verified
firefox-esr10 12+ verified

People

(Reporter: jruderman, Assigned: mccr8)

Details

(Keywords: assertion, testcase, Whiteboard: [sg:critical][qa!])

Attachments

(3 files)

Attached file testcase —
1. Load the testcase 2. Click the button 3. Close the testcase 4. Force a CC (e.g. by quitting Firefox or using about:memory) Result: ###!!! ASSERTION: Fault in cycle collector: traversed refs exceed refcount (ptr: 127319dc0) : 'Not Reached', file /Users/jruderman/trees/mozilla-central/xpcom/base/nsCycleCollector.cpp, line 1235 In addition, everything leaks. I can reproduce with both Tinderbox debug builds and a local debug build (Mac OS X 10.6, 64-bit Firefox). Some variants crash instead, in exploitable-looking ways.
Assignee: nobody → continuation
When the cycle collector hits a fault like this, it turns itself off, which would explain the leaks.
The easiest way for me to reproduce this was to load the test case, click the button, then just exit the browser. The first shutdown CC will hit the assertion, I believe. The node in question is an <html> element. It has a refcount of 3, with 4 references to it. Those references are its 3 children, plus the document it is in. The children are a <head>, <body> and a nsGenericDOMDataNode. The nsGenericDOMDataNode seems like it has the most possibility for being what is wrong, but I'm not sure. The DOM isn't traversed until it isn't part of a currently displayed window, which is probably why you have to close the testcase before the assertion triggers.
I'm curious which <html> element. The testcase, about:blank, or data:text/html,1 ?
The test case. It looks like there's no data:test/html document by the time the CC fires off, but I could be wrong.
Oh, that explains why I had to leave in the line void ((new XMLSerializer).serializeToString(document.documentElement));
Is this a recent regression or has it been around for a while (e.g. does it affect Fx9 or earlier)?
I'll start looking at this again. Though I worry the recent CC changes will make the test case succeed, but the problem still remains. I can always go back to a slightly older build. I'm curious about what is causing it, but I wonder if in general we should just quit the browser if we hit this fault. It seems to me that if you have more references to something than the refcount thinks there are, that's just a dangling pointer waiting to happen. This is the first time I've ever heard of this happening, so hopefully it isn't a common problem. And by turning them into crashes, at least we'd have an idea if it was.
John glanced at the serializer code, and he said it looked like it was manually patching up some ref count related stuff, and had some scary comments, so that sounds promising.
Oh right this. I'll look at it after bug 723971.
I can't reproduce this on a recent Nightly debug build. I even tried making all CCs WANT_ALL_TRACES which should disable most of the recent CC optimizations, but that didn't seem to matter. On a debug Aurora build, it crashes immediately: Assertion failure: !cx->isExceptionPending(), at /builds/slave/m-aurora-osx-dbg/build/js/src/jscntxtinlines.h:314 I'm not sure what that means. On a recent 11 debug nightly, it doesn't seem to reproduce either.
Jesse, can you still reproduce this? Do you happen to have some similar test cases that do reproduce it? If not, I think I'll do a bisection on 11 to try to figure out where it stopped hitting the assertion, as I'm not really convinced this has been fixed magically.
Okay, now I can't reproduce this on a 12-08 nightly from here: ftp://ftp.mozilla.org/pub/firefox/nightly/2011/12/2011-12-08-mozilla-central-debug/ or on a couple of other December nightlies. I'm not sure what I'm doing wrong here.
Whiteboard: [sg:critical] → [sg:critical] [need answer to comment 12 from jruderman]
(In reply to Jesse Ruderman from comment #0) > Some variants crash instead, in exploitable-looking ways. Can we get one or two of those attached here? Might help mccr8 find the problem.
Okay, "fortunately" I can reproduce this issue on a local build of beta. I'll go from there.
Whiteboard: [sg:critical] [need answer to comment 12 from jruderman] → [sg:critical]
Thanks a lot to jst for helping me work through this. When we're morphing a slim wrapper into a regular wrapper, we create a new XPCWrappedNative, and set it to point at the native. We do not addref to do this, because the plan is to make the JS Object wrapper point to the XPCWN, so the ref count will stay the same. But we're in an inconsistent state until we make the update. We then attempt to enter the compartment of the existing JS object (the wrapper). This fails due to stack overflow. We then drop the last reference to the XPCWrappedNative, causing it to call its destructor. When the destructor runs, it sees that the XPCWN points to a native, so it calls release on this object, because it thinks it owns it, but it doesn't actually. This causes the ref count to be too low, causing the cycle collector assertion. The fix is to null out the pointer to the native in the XPCWN before we release it.
It looks like this was broken by brain transplants (bug 580128): http://hg.mozilla.org/mozilla-central/diff/9dc0c11e8eb8/js/src/xpconnect/src/xpcwrappednative.cpp That added the compartment enter check.
Attachment #604576 - Attachment description: fix an error path, patch against Fx11 → fix an error path, patch against Fx11 and 12
Attached patch fix for 13 — — Splinter Review
Basically the same.
Attachment #604590 - Flags: review?(bobbyholley+bmo)
\o/ Think we'll be able to fix this on the Firefox 10 branch? The bug will be fairly easy to discover once the nearScriptStackLimit gadget becomes public. It might become public as the result of other bugs being fixed, or as the result of the planned fuzzer release in April.
Isn't 11 releasing within a week? Pretty cool fuzzer tool, though! jst suggested that having a enter compartment thing that could be triggered to fail might help find weird error paths like this.
This is certainly fx10 esr material, but I think the fx11 ship has sailed at this point. If we do end up doing a 11.0.1 or what not this one could maybe ride along, but I wouldn't re-spin for this at this late of a time. Reliably exploiting this seems significantly less than trivial, even if people figure out how to reliably trigger this particular error, you still need to predict when the object in question gets prematurely destroyed, then write to the memory that held this object, and then wait for the existing references to this object to be touched, either by the CC, or some other code.
OS: Mac OS X → All
Hardware: x86_64 → All
Comment on attachment 604590 [details] [diff] [review] fix for 13 I'd love a comment here, but I can understand if we want to keep the commit low-profile. And slim wrappers are going away in the near future, so this will probably stay in my memory until then. Sorry for the delay. r=bholley
Attachment #604590 - Flags: review?(bobbyholley+bmo) → review+
We should land this toward the middle-end of the Fx12 cycle, and then land everywhere.
You mean the current Firefox 14 cycle? Sounds good to me. For my own reference, that's the very end of March.
[Triage Comment] Removing the tracking on this, leaving status-firefox-esr10 affected, we'll track it for the right esr release once it's fixed.
Comment on attachment 604576 [details] [diff] [review] fix an error path, patch against Fx11 and 12 [Approval Request Comment] Regression caused by (bug #): 580128 User impact if declined: users of 12 and 13 may be exposed to vulnerability once this lands Testing completed (on m-c, etc.): Passed a try run. I can land this on m-c a few days before I land it elsewhere to make sure it doesn't go crazy, but I'd prefer to have some kind of conditional approval before I do that, if possible. Risk to taking this patch (and alternatives if risky): Low. This patch only alters the behavior of a very rarely taken path that will probably crash anyways. String changes made by this patch: none
Attachment #604576 - Flags: approval-mozilla-beta?
Attachment #604576 - Flags: approval-mozilla-aurora?
Comment on attachment 604576 [details] [diff] [review] fix an error path, patch against Fx11 and 12 [Triage Comment] The process for landing sg:crit bugs hasn't changed since the FF11 security post-mortem, although other landing strategies were discussed. Unless this bug has special circumstances, let's land on m-c and then nominate once it's had a couple of days to bake for stability's sake.
Attachment #604576 - Flags: approval-mozilla-beta?
Attachment #604576 - Flags: approval-mozilla-aurora?
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment on attachment 604590 [details] [diff] [review] fix for 13 [Approval Request Comment] Regression caused by (bug #): bug 580128 User impact if declined: security vulnerability. Apparently this may become easy to find with fuzzer tool release in April (see comment 19). Testing completed (on m-c, etc.): it has been on m-c for a few days Risk to taking this patch (and alternatives if risky): should be low. This only alters the behavior of a rarely taken control path that is currently broken to the point where it will probably always crash it we hit it. String changes made by this patch: none
Attachment #604590 - Flags: approval-mozilla-esr10?
Attachment #604590 - Flags: approval-mozilla-beta?
Attachment #604590 - Flags: approval-mozilla-aurora?
Comment on attachment 604590 [details] [diff] [review] fix for 13 [Triage Comment] Low risk sg:crit fix - approved for all branches.
Attachment #604590 - Flags: approval-mozilla-esr10?
Attachment #604590 - Flags: approval-mozilla-esr10+
Attachment #604590 - Flags: approval-mozilla-beta?
Attachment #604590 - Flags: approval-mozilla-beta+
Attachment #604590 - Flags: approval-mozilla-aurora?
Attachment #604590 - Flags: approval-mozilla-aurora+
Whiteboard: [sg:critical] → [sg:critical][qa+]
Group: core-security
(In reply to Jesse Ruderman from comment #0) > Created attachment 580173 [details] > testcase > > 1. Load the testcase > 2. Click the button > 3. Close the testcase > 4. Force a CC (e.g. by quitting Firefox or using about:memory) > > Result: > > ###!!! ASSERTION: Fault in cycle collector: traversed refs exceed refcount > (ptr: 127319dc0) > : 'Not Reached', file > /Users/jruderman/trees/mozilla-central/xpcom/base/nsCycleCollector.cpp, line > 1235 Crashes after step 2 on ftp://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012-05-18-mozilla-beta-debug on Win 7, Ubuntu 12.04 and Mac OS X 10.6: Assertion failure: !cx->isExceptionPending(), at e:\builds\moz2_slave\m-beta-w32 -dbg\build\js\src\jscntxtinlines.h:317 https://crash-stats.mozilla.com/report/pending/6a95f8f1-1b4c-483b-b125-999632120521
Paul, can you file a new bug report for that? Do you know if it's a regression that happened after this bug's patch landed?
Sure Jesse. Bug 757004 submitted. I'll look for a regression range and post there the results
Based on comment 34 marking this as verified. No "ASSERTION: Fault in cycle collector: traversed refs exceed refcount" occurs in FF 13, 14 and ESR 10.0.5.
Status: RESOLVED → VERIFIED
Whiteboard: [sg:critical][qa+] → [sg:critical][qa!]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: