Closed Bug 672436 Opened 13 years ago Closed 13 years ago

Assertion failure: priv->stackElems + stackDepth == elem

Categories

(Core :: JavaScript Engine, defect)

7 Branch
x86
All
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
firefox7 --- fixed
firefox8 --- fixed
status1.9.2 --- unaffected

People

(Reporter: bc, Assigned: luke)

References

()

Details

(Keywords: assertion, regression, reproducible, Whiteboard: [sg:critical?] js-triage-done [qa-])

Crash Data

Attachments

(6 files)

Attached file nightly debug stack
1. http://www.android-destek.com/?p=429]...........xrecovery-root
2. Assertion failure: priv->stackElems + stackDepth == elem, at /work/mozilla/builds/nightly/mozilla/js/src/jsexn.cpp:363

reproduced on Mac, Win Nightly and Aurora. Opt Aurora build didn't crash fwiw.

dmandelin, gal: lots of tracing on the stack.
Whiteboard: js-triage-needed
I have tried to reduce the test case, but I didn't manage to get it really small. I guess the problem depends on the size of the page. However I managed to remove a lot of the external references and scripts (apart from the Google translate stuff, it seems necessary to trigger this).

Although its still a big page, Google Chrome has no problem with it at all and loads the page almost instantly (but I guess the assertion already tells us it is a Firefox problem...)

The hang is unfortunately not 100% reproducible on the testcase, but when you ctrl-F5 several times (2 to 3 times is usually enough for me here, sometimes up to 5 times) it really hangs, the other times, it is just slow.

This testcase still gives the assertion when it gives a real hang:

Assertion failure: priv->stackElems + stackDepth == elem, at /home/bas/mozilla/src/js/src/jsexn.cpp:363

When it is just slow, I don't get the assertion...

I was able to remove about 70% of the size of the testcase. I hope this helps at least a bit with debugging.
bp-a947dcaa-2018-46c8-b443-b71ff2110723 winxp nightly/8
[@ WrappedNativeProtoMarker ]

bp-6acc476e-b0d5-4d25-b1f8-e1a352110723 winxp aurora/7
[@ JSCompartment::purge(JSContext*) ] 

note the socorro signatures associated with this assertion are js::gc::PushMarkStack which is filed as Bug 654903
Crash Signature: [@ WrappedNativeProtoMarker ] [@ JSCompartment::purge(JSContext*) ]
Thanks so much for this report! Bob, could you elaborate on comment 2? Does that mean that when you visited this page in an opt build, you crashed with those three signatures? I happened to crash in js::PropertyTable::search.

Based on the assertion, here's what I think is happening. When we fill out the exception object, we stop prematurely if we hit a stack frame from another compartment. Unfortunately, that leaves the rest of the exception object undefined (we do malloc instead of calloc in this case). Then if we trace through that exception object, we will try to "mark" through invalid pointers, which I'm guessing just causes us to corrupt arbitrary memory.

I'll take a closer look on Monday. Marking security sensitive due to the possible memory corruption.
Group: core-security
Bill, re comment 2 those two socorro reports/signatures were from current opt (nightly) builds for Aurora/Firefox 7 and Nightly/Firefox 8 that I downloaded just prior to testing.
I was wrong in comment 3. We're not hitting the assert because we break out of the loop when the compartment is different. I didn't notice that the previous loop also breaks out under the same condition, and it's the one that determines stackDepth. So that seems okay.

What's going on here is a little weirder. Basically, there are two loops that look like this:
    FrameRegsIter firstPass(cx);
    for (; !firstPass.done(); ++firstPass) { ...break at some point... }
    for (FrameRegsIter iter(cx); iter != firstPass; ++iter) { ... }
The assumption is that they will iterate over the exact same set of stack frames. The reason we assert is that they don't. The second loop sometimes includes an extra stack frame that the first one didn't see. This can lead to memory corruption and other bad stuff.

I instrumented the code to print the stack frames visited. In the bad case, it looks like this:
  first loop: B,C,D
  second loop: A,B,C,D
Then I added some code to print out when the top stack frame changes from B to A. It happens in the first iteration of the first loop, when we call checkAccess. This is some sort of security callback (nsScriptSecurityManager::CheckObjectAccess).

Some questions:
1. Is it reasonable for this CheckObjectAccess thing to add an extra stack frame to the current cx?
2. We call InitExnPrivate a lot on this web page. I noticed that in every single case except the bad one, we see the stack frames A,B,C,D (i.e., they're exactly the same addresses every time). Why is it that in this one case, we don't see A? Does this just mean that the stacks are usually of the same height when InitExnPrivate is called?

Maybe I'm being paranoid. It would be easy to "fix" this problem by saving the stack frames we see in a vector so we don't have to loop over them twice, or something like that. However, I want to make sure we're not papering over a deeper problem. It seems really weird to me that CheckObjectAccess is adding a frame. Could this be some kind of threading issue, where two threads are touching cx at the same time?

Luke, could you take a look at this? There's a lot I don't understand here (stack iteration, security checks).
Whiteboard: js-triage-needed → js-triage-done
Security pushes dummy frames (or maybe even real JS frames if somehow JS gets run), but this should definitely be done stack-ily such that the stack is the same before and after.  The underlying bug is probably an un-matched pop (e.g., JS_EnterCrossCompartmentCall without JS_LeaveCrossCompartmentCall).
Oh yeah, forgot to say, I'll take this; thanks for the triage.
Attached patch fix and testSplinter Review
Man, reproducible test cases are great.

The bug is that there are a few paths into InitExnPrivate which haven't called LeaveTrace.  Then the checkAccess hook causes LeaveTrace to be called so the stack changes size.  I think the cleanest fix is just to put LeaveTrace in StackIter.
Assignee: general → luke
Status: NEW → ASSIGNED
Attachment #548272 - Flags: review?(dvander)
Comment on attachment 548272 [details] [diff] [review]
fix and test

Review of attachment 548272 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #548272 - Flags: review?(dvander) → review+
Comment on attachment 548272 [details] [diff] [review]
fix and test

This should go on aurora.
Attachment #548272 - Flags: approval-mozilla-aurora?
http://hg.mozilla.org/mozilla-central/rev/714d30924fc5
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment on attachment 548272 [details] [diff] [review]
fix and test

We're going to let this one bubble up from central. Will this be a regression we'll ship in 7? How would this show up? If you can give us more context that would change our decision please nominate.
Attachment #548272 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
Comment on attachment 548272 [details] [diff] [review]
fix and test

The patch may be big but, as for the bits that ship, its a one line fix and the risk is low.  The bug would manifest itself as random crashes due to apparent JS corruption.  We know this hits (the test case is from the real web).  I think we should definitely take this.
Attachment #548272 - Flags: approval-mozilla-aurora- → approval-mozilla-aurora?
This is hitting, can we get this approved for aurora?
Luke, I wanted to test the fix on aurora.

Unfortunately, the changes to file js.cpp don't apply cleanly to the aurora branch. Will you attach a merged patch?
Changing version to 7, because I get this assertion on the Aurora7 branch.
Version: Trunk → 7 Branch
This is my attempt to merge the patch to aurora branch.

I confirm this patch fixes the aurora crash that I reported in bug 675966.
(In reply to comment #17)
> Changing version to 7, because I get this assertion on the Aurora7 branch.

Kai, I'm confused by the version change. Yes, this exists on 7 which is reflected in the status-firefox7 flag but the bug is marked fixed since it was fixed on trunk/8 not on aurora/7. Wouldn't it be more appropriate for the version to remain on trunk with status-firefox8 set to fixed and status-firefox7 set to affected?
Hi Bob, I don't want to confuse things. I assumed the version field is set to the earliest version where the bug can be seen. If my understanding is wrong, please revert to trunk. Thanks.
Comment on attachment 548272 [details] [diff] [review]
fix and test

Approved for releases/mozilla-aurora
Attachment #548272 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Does this apply to 1.9.2 as well? If so we'll want this for 1.9.2.21
Attached file beta stack (mac os x)
I can still see this assertion with a somewhat different but related stack on Beta/7 Win/Mac/Linux. I have three urls (one with personally identifiable information and two nsfw) that reproduce the assertion but that don't crash the nightly Beta. I'll try to get a reduced test case.
Bob: that assertion is different than the one in this bug.  (It's also less bad; in the past the fix has just been tweaks to satisfy the assert.)  Nonetheless, we do want to get them fixed; do the asserts reproduce on nightly?
Luke, this appears to be Beta only. New bug? Security Sensitive?
You've found several testcases that have triggered this particular assert and, IIRC, for some of them (the ones where there was no real problem in release builds) we decided to not land on branches and wait for the fixes to bubble up naturally.  So I would say no new bug and not security sensitive.
Any chance of a 3.6 patch for this? Is it affected? We don't want to announce this fix in 7 only to zero-day Firefox 3.6 users. Thanks!
3.6 is not affected.
I have never seen this specific assertion on 1.9.2 nor has 1.9.2 crashed on the url.
Yay, thanks! Marking as such.
qa- as no QA fix verification is needed
Whiteboard: js-triage-done → js-triage-done [qa-]
Keywords: regression
Whiteboard: js-triage-done [qa-] → [sg:critical?] js-triage-done [qa-]
Group: core-security
You need to log in before you can comment on or make changes to this bug.