Correctness regression on the r-tree benchmark

RESOLVED FIXED in mozilla1.9.3a1

Status

()

defect
RESOLVED FIXED
10 years ago
7 years ago

People

(Reporter: bzbarsky, Assigned: jorendorff)

Tracking

({regression})

Trunk
mozilla1.9.3a1
Points:
---
Dependency tree / graph
Bug Flags:
blocking1.9.2 +

Firefox Tracking Flags

(status1.9.2 final-fixed, status1.9.1 unaffected)

Details

(Whiteboard: [sg:critical?] fixed-in-tracemonkey, )

Attachments

(3 attachments, 1 obsolete attachment)

The benchmark in the url bar throws an exception in 3.6rc1 instead of completing; works fine in 3.5, according to comments at http://hacks.mozilla.org/2010/01/javascript-speedups-in-firefox-3-6/

I can certainly reproduce on m-c.  Trying to find some regression ranges now.
Flags: blocking1.9.2?
So this used to break on "Performing 1k window queries.." but then started working there and now breaks on the "Deleting..." part.  That change was in this range:

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=fd765c193770&tochange=21972540dbf4

Tracemonkey merge in there, as expected.  I'll see if I can pin this down more.

Still not sure when the first (bigger) breakage appeared.
Original regression is in this range:

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=38754465ffde&tochange=3079370d6597

With again a tm merge, huge one.
My initial guess from that big merge is one of the closure fixes... trying to bisect now.

Updated

10 years ago
Flags: blocking1.9.2? → blocking1.9.2+
OK, bisecting for that first thing totally failed, because 64-bit jit was enabled _after_ the regression was introduced.  So the bisect found that changeset.

I'll try to find where the partial fix happened...
I have just confirmed that I do see this problem with Fx 3.6rc1 on Mac, by the way.
OK, the partial fix was due to the fix for bug 525028 looks like.
OK, bisect on t-m on my mac shows that the original regression was due to the landing of http://hg.mozilla.org/tracemonkey/rev/2db233807725

That's pretty darned surprising, but I double-checked it via local backout (updated to that changeset, then took out the #include, and the problem went away).

Can anyone explain that?  :(
And of course starting with the parent of that changeset and applying that changeset's patch _also_ works.  Sounds like there are dependency problems somewhere....  :(

On the other hand, then applying the patch for bug 505591 makes things fail.  Not sure whether that exposes an existing problem by making us trace more or introduces a problem.
Blocks: 505591
Keywords: regression
Can I get blocking rationale? Not saying that I disagree, per se, but I'm not sure I understand the actual impact to users of this problem, nor the scope or potential for a fix, and marking it as a 1.9.2 blocker effectively holds the release date for Firefox 3.6 at this time, so I'll need to understand and communicate this out to all affected parties.
It's a JavaScript-runs-correctly issue on at least one benchmark; we don't know how widely the triggering pattern is used on the web, but history indicates that we almost never *over*estimate the frequency of such things, unfortunately.

Does bisecting on 1.9.2 work more cleanly?  Certainly less changeset traffic, but maybe not more helpfully organized.
Assignee

Comment 11

10 years ago
Posted file Slightly reduced test case. (obsolete) —
> Does bisecting on 1.9.2 work more cleanly?

It'll take me another 3-4 hours of building to test, unfortunately.  I'll try to do that sometime today if no one else gets to it first.
OK.  So I put a breakpoint in the relevant JS_SetPendingException, and I get:

0 anonymous(b = undefined, a = [object Object]) ["http://stackulator.com/rtree/rtree.js":689]
1 anonymous(root = undefined, obj = undefined, rect = [object Object]) ["http://stackulator.com/rtree/rtree.js":103]
    t = undefined
    ltree = undefined
    i = undefined
    tree = undefined
    ret_obj = undefined

from DumpJSStack().  |ltree| is what's passed for |b|, so that makes sense, sort of.  But |ltree| is set by doing:

  var ltree = tree.nodes[i];

right before making that call to the function that takes a and b.  And if |tree| is undefined, why did that not throw?  And i should definitely not be undefined...

It feels like something has scribbled on memory somewhere.  Or something.
Assignee

Comment 14

10 years ago
The previous test case actually found an old bug, not the bug we're looking for now. Bogus test reduction.

This one throws a TypeError with -j in 1.9.2 tip shell and passes without -j (though it takes a few seconds to run).
Attachment #421642 - Attachment is obsolete: true
Assignee

Updated

10 years ago
Assignee: general → jorendorff
Assignee

Comment 16

10 years ago
TraceRecorder::record_JSOP_LENGTH()
{
    ...
    if (STOBJ_GET_CLASS(obj) == &js_ArgumentsClass) {
        ...
        LIns* v_ins = lir->ins1(LIR_i2f, INS_CONST(afp->argc));

But fp->argc does not change when a script assigns to arguments.length.

Note that by the time we get on trace, it may already be out of sync, so we can't fix this by detecting at record time if this has happened and aborting.

Note that this is not just a correctness bug. In bz's test case, argc is an int and arguments.length is an int. If instead arguments.length is a different type:

function f() {
    var x = arguments;
    arguments.length = {};
    for (var i = 0; i < 9; i++)
        print(x.length);               // segfault near 0
}
f();

TraceRecorder puts a small integer, argc, on the stack, but the interpreter puts an object on the stack. Hilarity ensues.
Group: core-security
Assignee

Comment 17

10 years ago
Filed bug 539766 for somewhat-related shenanigans involving arguments and Object.defineProperty.
Assignee

Comment 18

10 years ago
Posted patch v1Splinter Review
Attachment #421700 - Flags: review?(dmandelin)
(In reply to comment #18)
> Created an attachment (id=421700) [details]
> v1

I have questions about one feature: with this patch, for JSOP_ARGCNT, we record LIR that guards that the length has not changed and then returns the constant fp->argc, while for JSOP_LENGTH on |argsobj|, we record LIR that guards that the length has not changed and then returns the length out of the argsobj. Should they be different? Shouldn't they just both return the length out of the argsobj? And if so, do we need the guard, or can we just read then length out and return that?

It looks correct, though. The comments above just relate to how much we can stay on trace.
(In reply to comment #16)
> Note that this is not just a correctness bug. In bz's test case, argc is an int
> and arguments.length is an int. If instead arguments.length is a different
> type:

Are we any closer to understanding the potential web compatibility effect here, or when the regression appeared in betas? Our crash data for Firefox 3.6rc1 is actually quite encouraging, and we haven't had reports of major websites not functioning properly, so I'm wondering if we may have a correctness issue that affects things which - so far - the vast majority of websites aren't hitting?
(In reply to comment #20)
> (In reply to comment #16)
> > Note that this is not just a correctness bug. In bz's test case, argc is an int
> > and arguments.length is an int. If instead arguments.length is a different
> > type:
> 
> Are we any closer to understanding the potential web compatibility effect here,
> or when the regression appeared in betas? Our crash data for Firefox 3.6rc1 is
> actually quite encouraging, and we haven't had reports of major websites not
> functioning properly, so I'm wondering if we may have a correctness issue that
> affects things which - so far - the vast majority of websites aren't hitting?

This regression must have been introduced by one of the patches for tracing using the |arguments| keyword, most likely the first.

The bug is that we can go wrong if a use of |arguments| is traced, the JS program modifies the |arguments| object in a way that changes its length, and then the program accesses the |arguments| length indirectly (i.e., args=arguments; args.length). That code makes me very sad and is probably not used by most pages, *but* IME chasing down various regressions, advanced library code (e.g. jQuery) uses all kinds of weird things.
Attachment #421700 - Flags: review?(dmandelin) → review+
Assignee

Comment 22

10 years ago
(In reply to comment #20)
> Are we any closer to understanding the potential web compatibility effect here,
> or when the regression appeared in betas? Our crash data for Firefox 3.6rc1 is
> actually quite encouraging, and we haven't had reports of major websites not
> functioning properly, so I'm wondering if we may have a correctness issue that
> affects things which - so far - the vast majority of websites aren't hitting?

Maybe, but we discovered this because a Web page triggered it in the wild. The particular idiom that triggers the bug is reasonable, even if it does make dmandelin sad. I think the potential web compatibility effect is significant.

It's also a crasher.
(In reply to comment #22)
> (In reply to comment #20)
> > Are we any closer to understanding the potential web compatibility effect
> > here, or when the regression appeared in betas? Our crash data for Firefox
> > 3.6rc1 is actually quite encouraging, and we haven't had reports of major
> > websites not functioning properly, so I'm wondering if we may have a
> > correctness issue that affects things which - so far - the vast majority of
> > websites aren't hitting?
> 
> Maybe, but we discovered this because a Web page triggered it in the wild.
> The particular idiom that triggers the bug is reasonable, even if it does
> make dmandelin sad. I think the potential web compatibility effect is
> significant.
> It's also a crasher.

I agree. I guess I couldn't resist editorializing (and I still don't think it's reasonable :-D ), but I tried to say at the end that I think there is a high risk that big important stuff like jQuery or FB could trigger this bug.
Assignee

Comment 24

10 years ago
(In reply to comment #19)
> (In reply to comment #18)
> I have questions about one feature: with this patch, for JSOP_ARGCNT, we record
> LIR that guards that the length has not changed and then returns the constant
> fp->argc, while for JSOP_LENGTH on |argsobj|, we record LIR that guards that
> the length has not changed and then returns the length out of the argsobj.
> Should they be different? Shouldn't they just both return the length out of the
> argsobj?

We discussed on IRC, but for the record:

In the case of JSOP_ARGCNT, we know that we're talking about the arguments object for the current frame. In JSOP_LENGTH we have only guarded enough to know that it's an arguments object.

Since ARGCNT *can* safely use argc here, I figured it should, since that might enable some constant folding or eliminate a data dependency.

> And if so, do we need the guard, or can we just read then length out
> and return that?

No: if the bit is set, arguments.length is now just a regular property, and the value in that slot is no longer relevant.
Assignee

Comment 25

10 years ago
http://hg.mozilla.org/tracemonkey/rev/aba69ed5c41d
Whiteboard: fixed-in-tracemonkey
For what it's worth, a number of the codepaths in the Prototype library mess with the arguments object; I haven't looked at in a few days, but I'm pretty sure some of those can trigger this bug.  Of course it might not always immediately break on all sites.  Depends on what the arguments were and what the site is doing.
Added another test case to make the release safer:

http://hg.mozilla.org/tracemonkey/rev/a39e0b557864
I tested the |arguments| microbenchmark in the hacks article with a fresh TM build that includes this patch. Performance was not affected.
(In reply to comment #21)
> This regression must have been introduced by one of the patches for tracing
> using the |arguments| keyword, most likely the first.

Do we know when those patches were merged over to mozilla-1.9.2? I'm perhaps thinking about this wrong, but we've had about 400,000+ ADUs since Beta 3, and are now at about 1M ADUs. So far, aside from the r-tree benchmark page, I'm not aware of us having linked this problem to a page in the wild.

Allow me to be clear: I agree that this is a bad bug, and one that needs to be fixed. What I'm not sure of is whether it is a bug that we should fix before shipping 3.6.0 as opposed to fixing (along with other bad bugs, like not tracing web workers) in 3.6.1 in four to six weeks.

> args=arguments; args.length). That code makes me very sad and is probably not
> used by most pages, *but* IME chasing down various regressions, advanced
> library code (e.g. jQuery) uses all kinds of weird things.

Do we know of additional pages that regress due to this bug other than the r-tree benchmark? Again, I'm not trying to be difficult, I'm trying to make sure I understand the actual effect on users.
bz helped clear some of the questions up in IRC:

 - the original problem has existed since 1.9.2a1
 - it was exacerbated first in beta 1 by bug 505591
 - it was further exacerbated in beta 3 by bug 525028
I don't see any arguments.length (or a.length for a possibly an arguments object) setting in prototype-1.6.0.2, FWIW.

/be
a1921=beltzner

Whoever sees this first, olease land this on mozilla-1.9.2 ASAP so we can get some testing cycles on it. If we decide to do a 2nd RC for this, we'll then land it on the relbranch.
http://hg.mozilla.org/mozilla-central/rev/6c9304d3aa15
http://hg.mozilla.org/mozilla-central/rev/269b08fc3508
Status: NEW → RESOLVED
Closed: 10 years ago
OS: Mac OS X → All
Hardware: x86 → All
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.2
Target Milestone: mozilla1.9.2 → mozilla1.9.3a1

Comment 35

10 years ago
This appears to be burning the 3.6 tree down.

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1263541404.1263542720.10358.gz&fulltext=1

e:/builds/moz2_slave/mozilla-1.9.2-win32/build/js/src/jstracer.cpp(14027) : error C3861: 'RETURN_STOP_A': identifier not found
(In reply to comment #35)
> This appears to be burning the 3.6 tree down.
> 
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1263541404.1263542720.10358.gz&fulltext=1
> 
> e:/builds/moz2_slave/mozilla-1.9.2-win32/build/js/src/jstracer.cpp(14027) :
> error C3861: 'RETURN_STOP_A': identifier not found

I had to hand-patch 1.9.2 because of various function and type name differences between trunk and 1.9.2, but I guess I missed that macro difference. s/RETURN_STOP_A/ABORT_TRACE/ to fix.

http://hg.mozilla.org/releases/mozilla-1.9.2/rev/08a91a7b080a
Are we sure that fixed it? While I thank Reed for jumping on top of things, I actually assumed that the patch was already built to work with mozilla-1.9.2 and am now worried that this isn't the case.

Things are looking good, but I'd also like a confirmation from someone from the JS team or Shaver.

Comment 38

10 years ago
(In reply to comment #37)
> Are we sure that fixed it? While I thank Reed for jumping on top of things, I
> actually assumed that the patch was already built to work with mozilla-1.9.2
> and am now worried that this isn't the case.

Reed's fix is the right one, but never-compiled patches are never good.
Assignee

Comment 39

10 years ago
Yep, Reed's fix looks good.  Thanks, Reed.
Whiteboard: fixed-in-tracemonkey → [sg:critical?] fixed-in-tracemonkey

Updated

10 years ago
Depends on: 540404
Based on local backout this caused bug 540404 (which is a blocker at least to my
development work).
Flags: in-testsuite?
Who on this bug or with privs to view this bug was responsible for the description being posted here: http://ffextensionguru.wordpress.com/2010/01/17/firefox-3-6rc2-released/ ?
Ah, he got it out of the HG log, I suspect.
Group: core-security
Flags: in-testsuite?
You need to log in before you can comment on or make changes to this bug.