Last Comment Bug 708825 - "ASSERTION: Fault in cycle collector: traversed refs exceed refcount" closing window nearScriptStackLimit
: "ASSERTION: Fault in cycle collector: traversed refs exceed refcount" closing...
Status: VERIFIED FIXED
[sg:critical][qa!]
: assertion, testcase
Product: Core
Classification: Components
Component: XPCOM (show other bugs)
: Trunk
: All All
: -- critical (vote)
: mozilla12
Assigned To: Andrew McCreight [:mccr8]
:
Mentors:
Depends on:
Blocks: 594645
  Show dependency treegraph
 
Reported: 2011-12-08 13:36 PST by Jesse Ruderman
Modified: 2012-05-22 02:47 PDT (History)
11 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
-
wontfix
-
wontfix
-
wontfix
-
wontfix
+
verified
+
verified
+
verified
12+
verified


Attachments
testcase (624 bytes, text/html)
2011-12-08 13:36 PST, Jesse Ruderman
no flags Details
fix an error path, patch against Fx11 and 12 (1.25 KB, patch)
2012-03-09 17:47 PST, Andrew McCreight [:mccr8]
no flags Details | Diff | Splinter Review
fix for 13 (1.32 KB, patch)
2012-03-09 18:43 PST, Andrew McCreight [:mccr8]
bobbyholley: review+
akeybl: approval‑mozilla‑aurora+
akeybl: approval‑mozilla‑beta+
akeybl: approval‑mozilla‑esr10+
Details | Diff | Splinter Review

Description Jesse Ruderman 2011-12-08 13:36:43 PST
Created attachment 580173 [details]
testcase

1. Load the testcase
2. Click the button
3. Close the testcase
4. Force a CC (e.g. by quitting Firefox or using about:memory)

Result:

###!!! ASSERTION: Fault in cycle collector: traversed refs exceed refcount (ptr: 127319dc0)
: 'Not Reached', file /Users/jruderman/trees/mozilla-central/xpcom/base/nsCycleCollector.cpp, line 1235

In addition, everything leaks.

I can reproduce with both Tinderbox debug builds and a local debug build (Mac OS X 10.6, 64-bit Firefox).

Some variants crash instead, in exploitable-looking ways.
Comment 1 Andrew McCreight [:mccr8] 2011-12-08 14:10:32 PST
When the cycle collector hits a fault like this, it turns itself off, which would explain the leaks.
Comment 2 Andrew McCreight [:mccr8] 2011-12-08 15:16:15 PST
The easiest way for me to reproduce this was to load the test case, click the button, then just exit the browser.  The first shutdown CC will hit the assertion, I believe.

The node in question is an <html> element.  It has a refcount of 3, with 4 references to it.  Those references are its 3 children, plus the document it is in.  The children are a <head>, <body> and a nsGenericDOMDataNode.  The nsGenericDOMDataNode seems like it has the most possibility for being what is wrong, but I'm not sure.

The DOM isn't traversed until it isn't part of a currently displayed window, which is probably why you have to close the testcase before the assertion triggers.
Comment 3 Jesse Ruderman 2011-12-08 16:41:23 PST
I'm curious which <html> element. The testcase, about:blank, or data:text/html,1 ?
Comment 4 Andrew McCreight [:mccr8] 2011-12-08 16:44:48 PST
The test case.  It looks like there's no data:test/html document by the time the CC fires off, but I could be wrong.
Comment 5 Jesse Ruderman 2011-12-08 16:53:53 PST
Oh, that explains why I had to leave in the line

  void ((new XMLSerializer).serializeToString(document.documentElement));
Comment 6 Daniel Veditz [:dveditz] 2011-12-15 13:07:49 PST
Is this a recent regression or has it been around for a while (e.g. does it affect Fx9 or earlier)?
Comment 7 Andrew McCreight [:mccr8] 2012-01-31 16:09:08 PST
I'll start looking at this again.  Though I worry the recent CC changes will make the test case succeed, but the problem still remains.  I can always go back to a slightly older build.

I'm curious about what is causing it, but I wonder if in general we should just quit the browser if we hit this fault.  It seems to me that if you have more references to something than the refcount thinks there are, that's just a dangling pointer waiting to happen.  This is the first time I've ever heard of this happening, so hopefully it isn't a common problem.  And by turning them into crashes, at least we'd have an idea if it was.
Comment 8 Andrew McCreight [:mccr8] 2012-01-31 16:21:41 PST
John glanced at the serializer code, and he said it looked like it was manually patching up some ref count related stuff, and had some scary comments, so that sounds promising.
Comment 9 Johnny Stenback (:jst, jst@mozilla.com) 2012-03-01 13:22:17 PST
Andrew, ping?
Comment 10 Andrew McCreight [:mccr8] 2012-03-01 15:04:05 PST
Oh right this.  I'll look at it after bug 723971.
Comment 11 Andrew McCreight [:mccr8] 2012-03-06 13:08:22 PST
I can't reproduce this on a recent Nightly debug build.  I even tried making all CCs WANT_ALL_TRACES which should disable most of the recent CC optimizations, but that didn't seem to matter.

On a debug Aurora build, it crashes immediately:
Assertion failure: !cx->isExceptionPending(), at /builds/slave/m-aurora-osx-dbg/build/js/src/jscntxtinlines.h:314

I'm not sure what that means.

On a recent 11 debug nightly, it doesn't seem to reproduce either.
Comment 12 Andrew McCreight [:mccr8] 2012-03-06 13:18:35 PST
Jesse, can you still reproduce this?  Do you happen to have some similar test cases that do reproduce it?

If not, I think I'll do a bisection on 11 to try to figure out where it stopped hitting the assertion, as I'm not really convinced this has been fixed magically.
Comment 13 Andrew McCreight [:mccr8] 2012-03-06 13:48:38 PST
Okay, now I can't reproduce this on a 12-08 nightly from here:
  ftp://ftp.mozilla.org/pub/firefox/nightly/2011/12/2011-12-08-mozilla-central-debug/

or on a couple of other December nightlies.  I'm not sure what I'm doing wrong here.
Comment 14 Daniel Veditz [:dveditz] 2012-03-06 17:42:49 PST
(In reply to Jesse Ruderman from comment #0)
> Some variants crash instead, in exploitable-looking ways.

Can we get one or two of those attached here? Might help mccr8 find the problem.
Comment 15 Andrew McCreight [:mccr8] 2012-03-07 11:22:11 PST
Okay, "fortunately" I can reproduce this issue on a local build of beta. I'll go from there.
Comment 16 Andrew McCreight [:mccr8] 2012-03-09 17:47:56 PST
Created attachment 604576 [details] [diff] [review]
fix an error path, patch against Fx11 and 12

Thanks a lot to jst for helping me work through this.

When we're morphing a slim wrapper into a regular wrapper, we create a new XPCWrappedNative, and set it to point at the native. We do not addref to do this, because the plan is to make the JS Object wrapper point to the XPCWN, so the ref count will stay the same. But we're in an inconsistent state until we make the update.

We then attempt to enter the compartment of the existing JS object (the wrapper). This fails due to stack overflow.  We then drop the last reference to the XPCWrappedNative, causing it to call its destructor.  When the destructor runs, it sees that the XPCWN points to a native, so it calls release on this object, because it thinks it owns it, but it doesn't actually.  This causes the ref count to be too low, causing the cycle collector assertion.

The fix is to null out the pointer to the native in the XPCWN before we release it.
Comment 17 Andrew McCreight [:mccr8] 2012-03-09 18:01:33 PST
It looks like this was broken by brain transplants (bug 580128):
  http://hg.mozilla.org/mozilla-central/diff/9dc0c11e8eb8/js/src/xpconnect/src/xpcwrappednative.cpp

That added the compartment enter check.
Comment 18 Andrew McCreight [:mccr8] 2012-03-09 18:43:23 PST
Created attachment 604590 [details] [diff] [review]
fix for 13

Basically the same.
Comment 19 Jesse Ruderman 2012-03-09 22:13:34 PST
\o/

Think we'll be able to fix this on the Firefox 10 branch?  The bug will be fairly easy to discover once the nearScriptStackLimit gadget becomes public. It might become public as the result of other bugs being fixed, or as the result of the planned fuzzer release in April.
Comment 20 Andrew McCreight [:mccr8] 2012-03-09 22:18:35 PST
Isn't 11 releasing within a week?

Pretty cool fuzzer tool, though!  jst suggested that having a enter compartment thing that could be triggered to fail might help find weird error paths like this.
Comment 21 Johnny Stenback (:jst, jst@mozilla.com) 2012-03-10 00:25:33 PST
This is certainly fx10 esr material, but I think the fx11 ship has sailed at this point. If we do end up doing a 11.0.1 or what not this one could maybe ride along, but I wouldn't re-spin for this at this late of a time.

Reliably exploiting this seems significantly less than trivial, even if people figure out how to reliably trigger this particular error, you still need to predict when the object in question gets prematurely destroyed, then write to the memory that held this object, and then wait for the existing references to this object to be touched, either by the CC, or some other code.
Comment 22 Bobby Holley (:bholley) (busy with Stylo) 2012-03-13 19:34:42 PDT
Comment on attachment 604590 [details] [diff] [review]
fix for 13

I'd love a comment here, but I can understand if we want to keep the commit low-profile. And slim wrappers are going away in the near future, so this will probably stay in my memory until then.

Sorry for the delay. r=bholley
Comment 23 Daniel Veditz [:dveditz] 2012-03-15 13:23:30 PDT
We should land this toward the middle-end of the Fx12 cycle, and then land everywhere.
Comment 24 Andrew McCreight [:mccr8] 2012-03-15 13:32:53 PDT
You mean the current Firefox 14 cycle?  Sounds good to me.

For my own reference, that's the very end of March.
Comment 25 Lukas Blakk [:lsblakk] use ?needinfo 2012-03-20 10:05:53 PDT
[Triage Comment]
Removing the tracking on this, leaving status-firefox-esr10 affected, we'll track it for the right esr release once it's fixed.
Comment 26 Andrew McCreight [:mccr8] 2012-03-28 15:32:21 PDT
Comment on attachment 604576 [details] [diff] [review]
fix an error path, patch against Fx11 and 12

[Approval Request Comment]
Regression caused by (bug #): 580128
User impact if declined: users of 12 and 13 may be exposed to vulnerability once this lands
Testing completed (on m-c, etc.): Passed a try run.  I can land this on m-c a few days before I land it elsewhere to make sure it doesn't go crazy, but I'd prefer to have some kind of conditional approval before I do that, if possible.
Risk to taking this patch (and alternatives if risky): Low. This patch only alters the behavior of a very rarely taken path that will probably crash anyways. 
String changes made by this patch: none
Comment 27 Alex Keybl [:akeybl] 2012-03-29 15:35:43 PDT
Comment on attachment 604576 [details] [diff] [review]
fix an error path, patch against Fx11 and 12

[Triage Comment]
The process for landing sg:crit bugs hasn't changed since the FF11 security post-mortem, although other landing strategies were discussed. Unless this bug has special circumstances, let's land on m-c and then nominate once it's had a couple of days to bake for stability's sake.
Comment 28 Andrew McCreight [:mccr8] 2012-03-30 11:11:24 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/428dbcf861b5
Comment 29 Andrew McCreight [:mccr8] 2012-03-31 15:02:27 PDT
https://hg.mozilla.org/mozilla-central/rev/428dbcf861b5
Comment 30 Andrew McCreight [:mccr8] 2012-04-03 11:47:32 PDT
Comment on attachment 604590 [details] [diff] [review]
fix for 13

[Approval Request Comment]
Regression caused by (bug #): bug 580128
User impact if declined: security vulnerability.  Apparently this may become easy to find with fuzzer tool release in April (see comment 19).
Testing completed (on m-c, etc.): it has been on m-c for a few days
Risk to taking this patch (and alternatives if risky): should be low.  This only alters the behavior of a rarely taken control path that is currently broken to the point where it will probably always crash it we hit it.
String changes made by this patch: none
Comment 31 Alex Keybl [:akeybl] 2012-04-03 11:57:04 PDT
Comment on attachment 604590 [details] [diff] [review]
fix for 13

[Triage Comment]
Low risk sg:crit fix - approved for all branches.
Comment 33 Paul Silaghi, QA [:pauly] 2012-05-21 02:41:21 PDT
(In reply to Jesse Ruderman from comment #0)
> Created attachment 580173 [details]
> testcase
> 
> 1. Load the testcase
> 2. Click the button
> 3. Close the testcase
> 4. Force a CC (e.g. by quitting Firefox or using about:memory)
> 
> Result:
> 
> ###!!! ASSERTION: Fault in cycle collector: traversed refs exceed refcount
> (ptr: 127319dc0)
> : 'Not Reached', file
> /Users/jruderman/trees/mozilla-central/xpcom/base/nsCycleCollector.cpp, line
> 1235

Crashes after step 2 on ftp://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012-05-18-mozilla-beta-debug on Win 7, Ubuntu 12.04 and Mac OS X 10.6:

Assertion failure: !cx->isExceptionPending(), at e:\builds\moz2_slave\m-beta-w32
-dbg\build\js\src\jscntxtinlines.h:317

https://crash-stats.mozilla.com/report/pending/6a95f8f1-1b4c-483b-b125-999632120521
Comment 34 Jesse Ruderman 2012-05-21 02:57:33 PDT
Paul, can you file a new bug report for that? Do you know if it's a regression that happened after this bug's patch landed?
Comment 35 Paul Silaghi, QA [:pauly] 2012-05-21 04:22:47 PDT
Sure Jesse. Bug 757004 submitted. I'll look for a regression range and post there the results
Comment 36 Paul Silaghi, QA [:pauly] 2012-05-22 02:19:25 PDT
Based on comment 34 marking this as verified. No "ASSERTION: Fault in cycle collector: traversed refs exceed refcount" occurs in FF 13, 14 and ESR 10.0.5.
Comment 37 Paul Silaghi, QA [:pauly] 2012-05-22 02:47:08 PDT
Executing the steps in comment 0 and closing Firefox after --> bug 404828
ftp://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012-05-18-mozilla-central-debug

Note You need to log in before you can comment on or make changes to this bug.