Closed Bug 543100 Opened 10 years ago Closed 9 years ago

Long "else if" chain causes "too much recursion" error

Categories

(Core :: JavaScript Engine, defect, major)

x86
macOS
defect
Not set
major

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- betaN+
blocking1.9.2 --- -
status1.9.2 --- wanted

People

(Reporter: florian, Assigned: luke)

References

()

Details

(Keywords: regression, testcase, Whiteboard: fixed-in-tracemonkey)

Attachments

(4 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6
Build Identifier: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6

Some pages containing javascript do not load the whole page and post a Too Much Recursion Error

Reproducible: Always

Steps to Reproduce:
1. GoTo https://cls.scalix.com/support.php 
Actual Results:  
Page loads partially; form fields are missing.

Error console has

Error: too much recursion

Expected Results:  
Page should load w/o such error, displaying forms. 

This worked fine in FF3.5

It also seems to work fine in FF3.6/Windows and Linux, so it is specific to FF3.6/Mac.
Can you try in a new profile please?
http://support.mozilla.com/en-US/kb/Managing+profiles
Version: unspecified → 3.6 Branch
(In reply to comment #1)
> Can you try in a new profile please?
> http://support.mozilla.com/en-US/kb/Managing+profiles

- started FF with ProfileManager
- created new profile ("test1")
- opted to ask on startup, restarted, selected new profile
- FF shows "Know Your Rights" bar as well as default FF3.6 homepage
- open link provided in bug

same issue and Error console message occurs.

This has been reproduced on another machine as well. In case it matters, this is Mac OS/X 10.6.2, all Software Updates installed, running on a MBP 15".
What needs to happen so that anyone looks at this? A number of users in our organization cannot upgrade to FF3.6 because of this regression and as it's happening on a number of pages associated with that system and the JS used is not overly complex, I find it hard to believe that this is the only website where it happens.
Summary: Page does not with "too much recursion error" → Regression: Page does not with "too much recursion error"
Summary: Regression: Page does not with "too much recursion error" → Regression: Page does not load with "too much recursion error"
Is there any way to get this confirmed and looked at?

We have, in the meantime, re-arranged the JScript of this page, even taken out a number of non-critical functions, etc., however CANNOT get it to work. 

I had to migrate back to Safari and FF3.5

This is a FULLY reproduceable REGRESSION of FF3.6/Mac, happening on multiple client systems, etc., and with a sample page that exhibits the behavior.
WFM with Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.3a3pre) Gecko/20100315 Minefield/3.7a3pre ID:20100315035441.
Please test with a nightly.
http://nightly.mozilla.org/
(In reply to comment #5)
> WFM with Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.3a3pre)

It's been clearly stated in Comment #0 that this is only reproducible on 3.6/Mac, not on Windows or Linux.

> Please test with a nightly.
> http://nightly.mozilla.org/

will do.
Very same error happens on

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.3a4pre) Gecko/20100316 Minefield/3.7a4pre

page at

https://cls.scalix.com/support.php 

does not load and produces the "Too Much Recursion" entry in error console.
also happens on

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2

Is there a chance that someone at least CONFIRMS this issue?
blocking2.0: --- → ?
It's hard to see what the purpose is of reporting issues with FF when noone even bothers to confirm an issue that is 100% reproducible across multiple versions of product with a publicly accessible webpage, is a clear regression over previous release and affects a relatively simple page as well.

Quick disconcerting, IMHO. Especially as it makes our customers unable to log support cases with us using firefox. I find it hard to believe that we as a Linux/OpenSource-related company must now recommend to a certain class of users to move away from Firefox or stick with an older version.
Florian, the majority of people on bugzilla are volunteers, and most don't even own macs. so, until someone with a mac looks at this, we can't really confirm it.
(In reply to comment #10)
> Florian, the majority of people on bugzilla are volunteers, and most don't even
> own macs. so, until someone with a mac looks at this, we can't really confirm
> it.

Tyler - that's much appreciated, including your own personal dedication and attention. However, I think for a project as visible as FF, there must be some structure or more dedicated people giving attention to platform-specific issues at some level - I mean, how would the community e.g. react if someone reported a possible security issue caused by a platform specific buffer overflow or similar; this would need some max. time to evaluate, at the very least, otherwise it would make the platform non-sustainable.

I have a real problem here; I do represent a software vendor, with a focus on "alternative", i.e. non-vendor environments; we've been supporting Firefox for our AJAX web clients inside our app since the 1.0 days, with significant effort spent, and been committed to this ever since. 

Now, one of the web pages affected by this issue (and made unuseable, to be clear), is our main commercial support incident report form, i.e. the entry point into the support process for our customers. Our web developmer has already spent couple hours refactoring the JS code on that page, in the hope to remove the condition that triggers the Firefox problem, to no avail. Unfortunately, we don't have any inside development capacity for Mozilla code in-house, otherwise we'd have tried to debug this ourselves.

A number of our customers have noticed that the page doesn't work for them; so far, I've been responding on a personal and 1:1 level, pointing to this bug of an early 3.6 release, assuming it's something the FF team would look at for a 3.6.x patch update; a couple of those have come, and no progress. To limit my exposure to my customers, I'm now faced with the decision to put up a banner on that page stating that due to a FF bug, this page cannot be used with FF 3.6 and later on Mac, and people should move to other browsers for that, such as Safari and Chrome, or use an older version of FF.

This type of Anti-Marketing for FF is the last I wanna do, but given this has now been on the record for almost three months, I'm left with little choice.

So ... what do you suggest?

If there is any kind of JS/Tracemonkey debug mode we can run this in to collect and provide for better logging information, let me know and we'll do. If you wanna review the page's JS and suggest refactoring that may remove the issue, same thing - we'll do whatever we can, but I must take some action, one way or the other.

Tx,
Florian.
I'm therefore faced with the decid
> (In reply to comment #10)
> ... and most don't even own macs.

Now... :-) not sure if I agree here looking at the visible percentage percentage of silvery laptops sporting the familiar logo at "usual suspect" conferences in open source space such as Oscon and similar. ;-)
Attached file A somewhat reduced test case (obsolete) —
Confirmed on Mac 10.6 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.3a5pre) Gecko/20100412 Minefield/3.7a5pre.

It is causing a too much recursion error caused by recursive javascript parsing. You can see the error in the error console if you load this testcase.  I think this might be due to the massive number of closure functions in this function.  But I really don't know how to reduce it further.  I'll CC Jesse and Gary as they might know more about how to reduce this further to see if it is a known issue, a website issue or something new.

I verified that this occurs whether you attempt to load this piece of javascript via a script tag in the HTML (as in this testcase) or if you try to load it as a file, I had thought there might be a difference there when looking at the stack backtrace, but it turned out that the error is the same both ways.

Also, I think that bug 540861 is a dupe of this one -- the testcase is based off the same code.
Assignee: nobody → general
Status: UNCONFIRMED → NEW
Component: General → JavaScript Engine
Ever confirmed: true
Product: Firefox → Core
QA Contact: general → general
Version: 3.6 Branch → Trunk
Duplicate of this bug: 540861
Attached file simpler testcase
Attachment #439055 - Attachment is obsolete: true
The number of "else if" lines needed to trigger the bug seems to differ slightly between shell and browser.
Keywords: regression, testcase
Summary: Regression: Page does not load with "too much recursion error" → Long "else if" chain causes "too much recursion" error
Attached file shell testcase
(In reply to comment #16)
> The number of "else if" lines needed to trigger the bug seems to differ
> slightly between shell and browser.

That's expected. We monitor thread stack depth against a byte limit, not recursion depth against a recursing call count.

/be
(In reply to comment #18)
> That's expected. We monitor thread stack depth against a byte limit, not
> recursion depth against a recursing call count.

Is that a new feature in 3.6 (as resulting behavior is changed from 3.5!) and/or has the value of that limit changed for OS/X platform between these versions? Is it platform-specific at all? (I will assume that thread stack use as a byte count is a platform-dependent thing?)
Has not changed in ages. See

http://mxr.mozilla.org/mozilla-central/ident?i=JS_SetThreadStackLimit

Sensitive to compiler changes over time (size of frame for each function actived in a given flow).

Maybe something's busted elsewhere. If someone can gdb on Mac and catch the error as it is being turned into an exception, that would be helpful. IRC #jsapi people will help.

/be
autoBisect shows this is probably related to bug 494994:

The first bad revision is:
changeset:   28739:5b0eb5022f03
user:        Andreas Gal
date:        Tue May 26 18:53:42 2009 -0700
summary:     Build optimized JS shell with -O3 when using gcc (494994, r=sayrer).

Jesse's shell testcase seems to show a "PASS" with a shell from the changeset before this but a "too much recursion" error with a shell from this changeset.
blocking2.0: ? → beta1+
We should get this fixed on 1.9.2 as well, imo.  The platform difference is not good.

I can reproduce on mac opt, but obviously not mac debug.
blocking1.9.2: --- → ?
status1.9.2: --- → ?
Hrm.  So looking at the backtrace when I hit js_ReportOverRecursed, it looks like close to 1000 frames of JSCompiler::statement.

In the opt build:

(gdb) frame 100
#100 0x000b4b4f in JSCompiler::statement (this=0xbfffeba8) at /Users/bzbarsky/mozilla/tracemonkey/mozilla/js/src/jsparse.cpp:4566
4566                    if (tt == TOK_CASE) {
(gdb) info frame
Stack level 100, frame at 0xbff8fb00:
 called by frame at 0xbff8fd00, caller of frame at 0xbff8f900
(gdb) p 0xbff8fd00 - 0xbff8fb00 
$12 = 512


The corresponding debug build value is 288.

On 64 bit Linux, fwiw, I get 416 in a debug build and 288 in an opt build (which seems to just stick all sorts of stuff in registers instead of on the stack).  The debug build ends up with the recursion error, while the opt build does not.

It really seems like there's not much we can do here given all the compiler differences...
We're not trying to be precise (counting frames, e.g.) or cross-platform. We just don't want to take down a thread, or the whole process (or worse, violate memory safety -- possible with overlarge stack allocation via alloca, and I'm not sure the compiler's stack probes suffice to avoid this) due to overnested input, which could be a code generator mistake as much as an evil DOS.

So gcc -O3 uses less stack, not more. Great!

Can we adjust the test accordingly?

/be
Can the parser be changed so "else if" chains don't cause deep recursion?
(Midaired with Jesse, who needs to get out of my brain.)

Could we fold the if/else tree into a list, like we do with left-assoc binary operators?
(In reply to comment #26)
> (Midaired with Jesse, who needs to get out of my brain.)
> 
> Could we fold the if/else tree into a list, like we do with left-assoc binary
> operators?

The emitter does this:

http://mxr.mozilla.org/mozilla-central/source/js/src/jsemit.cpp#4551

but the parser cannot, since there is no pre-parser to recognize sentences in the language and build a tree, from which to fold things into a list:

From Parser::statement():

      case TOK_IF:
        /* An IF node has three kids: condition, then, and optional else. */
        pn = TernaryNode::create(tc);
        if (!pn)
            return NULL;
        pn1 = condition();
        if (!pn1)
            return NULL;
        js_PushStatement(tc, &stmtInfo, STMT_IF, -1);
        pn2 = statement();
        if (!pn2) 
            return NULL;
        if (tokenStream.matchToken(TOK_ELSE, TSF_OPERAND)) {
            stmtInfo.type = STMT_ELSE;
            pn3 = statement();
            if (!pn3)
                return NULL;
            pn->pn_pos.end = pn3->pn_pos.end;
        } else {
            pn3 = NULL;
            pn->pn_pos.end = pn2->pn_pos.end;
        }   
        PopStatement(tc); 
        pn->pn_kid1 = pn1;
        pn->pn_kid2 = pn2;
        pn->pn_kid3 = pn3;
        return pn;

Anyway, the problem is not too much recursion -- it's easy to make the parser recur and we would be chasing a lot of bugs if we didn't go deep enough to "parse the web" in general. We seem to handle the site mentioned in comment 0 on Windows and Linux.

So why does the Mac build switching to gcc -O3, which uses *less* stack, make this error bite only 3.6 Mac? According to bz's gdb'ing, we end up near 1000 statements deep while Linux doesn't reach 300 statements. Is 3.5 on the Mac simply throwing sooner and that somehow doesn't get noticed?

IOW, why is tolerating deeper recursion on Mac in 3.6 leading to a new error? You'd expect "too much recursion" error checking to be monotonic.

/be
> So why does the Mac build switching to gcc -O3, which uses *less* stack,

Where did that come from?  Mac opt uses more stack than mac debug; see comment 23.  I didn't compare -O2 to -O3.

> According to bz's gdb'ing, we end up near 1000 statements deep while Linux
> doesn't reach 300 statements.

I think you're misreading the numbers.  On Mac -O3, a single stack frame of Parser::statement() is 512 bytes.  On Mac debug, it's 288 bytes.  On Linux 64-bit opt (whatever level that is), it's 288 bytes.  On Linux 64-bit debug it's 416 bytes.
And to put more numbers to this, the testcase has 1335 else ifs.  1335 * 512 == 683520.  1335 * 416 == 555360.  1335 * 288 == 384480.  Our stack limit is 500000.  Which is why we get the exception on my linux debug build and mac opt build but not the other two.  So it all makes sense.
I must have misread comment 23 -- still not sure how, but never mind, and sorry.

Ok, so why is gcc -O3 bloating up js::Parser::statement's frame? There are not a lot of block local variables in this method. Is there any way to get gcc to say what is on its mind in doing this?

/be
Oh, I misread "... it looks like close to 1000 frames ... $12 = 512 ... The corresponding debug build value is 288" to mean by "debug build value" the "1000", not the equivalent to $12. Yes, big is bad.

512 bytes is a ton of space. The locals in js::Parser::statement are 

    TokenKind tt;
    JSParseNode *pn, *pn1, *pn2, *pn3, *pn4;
    JSStmtInfo stmtInfo, *stmt, *stmt2;
    JSAtom *label;

    JS_CHECK_RECURSION(context, return NULL);

where there's a dummy in the JS_CHECK_RECURSION macro, and stmtInfo is 32 bytes.

10 * 4 + 32 << 512

/be
case TOK_TRY: adds two pointers, catchList and lastCatch. TOK_FOR and TOK_SWITCH add a bit more, TOK_FOR in particular another JSStmtInfo record. But even if gcc were summing instead of unifying, I don't see how it reached 512.

/be
Well, we have 20 bytes of saved registers up front. Then the locals.  then stuff like:

0x000b0712 <_ZN10JSCompiler9statementEv+626>:	call   0xb04a0 <_ZN10JSCompiler9statementEv>
0x000b0717 <_ZN10JSCompiler9statementEv+631>:	mov    %eax,-0x120(%ebp)

and

0x000b074e <_ZN10JSCompiler9statementEv+686>:	je     0xb07c1 <_ZN10JSCompiler9statementEv+801>
0x000b0750 <_ZN10JSCompiler9statementEv+688>:	add    $0x20,%esi
0x000b0753 <_ZN10JSCompiler9statementEv+691>:	mov    %esi,-0x130(%ebp)
0x000b0759 <_ZN10JSCompiler9statementEv+697>:	lea    -0x20(%ebp),%ecx
0x000b075c <_ZN10JSCompiler9statementEv+700>:	mov    %ecx,-0x134(%ebp)

and

0x000b12d2 <_ZN10JSCompiler9statementEv+3634>:	je     0xb134b <_ZN10JSCompiler9statementEv+3755>
0x000b12d4 <_ZN10JSCompiler9statementEv+3636>:	lea    0x20(%esi),%eax
0x000b12d7 <_ZN10JSCompiler9statementEv+3639>:	mov    %eax,-0x140(%ebp)

Those look like register spills to me, right?  That still only gets us up to 320 bytes, though....
Hoping for some help figuring out what provoked this possibly-only-Mac gcc -O3 stack frame size jump.

/be
Here's what a full stack frame looks like:

(gdb) x/128wx 0xbff8fb00
0xbff8fb00:     0xbfffeba8      0xbff8fca4      0x00000001      0xffffffff
0xbff8fb10:     0xbfffeba8      0xbfffeba8      0x0065fbe8      0x000a646e
0xbff8fb20:     0xbfffeba8      0xbfffeba8      0xbff8fba8      0x000a64f3
0xbff8fb30:     0xbfffeba8      0xbfffebd4      0x00000000      0xbfffec18
0xbff8fb40:     0x00000000      0x00000000      0x00610000      0xbfffeba8
0xbff8fb50:     0xbfffeba8      0xbfffeba8      0xbff8fba8      0x000a61a4
0xbff8fb60:     0xbfffeba8      0x0000001d      0x00000000      0xbfffebd4
0xbff8fb70:     0xbfffeba8      0xbfffeba8      0xbff8fbd8      0x000af489
0xbff8fb80:     0xbfffeba8      0xbfffebd4      0x008535c0      0x00853680
0xbff8fb90:     0x0000001e      0xbfffeba8      0xbff8fc68      0x000a699e
0xbff8fba0:     0xbfffeba8      0xbfffeba8      0xbff8fc28      0x000a69aa
0xbff8fbb0:     0xbfffeba8      0xbfffebd4      0xbff902a8      0x00853620
0xbff8fbc0:     0xbfffebd4      0xbfffecb8      0x00000000      0xbfffeba8
0xbff8fbd0:     0xbfffeba8      0xbfffeba8      0x00853770      0x000af874
0xbff8fbe0:     0xbfffeba8      0xbfffeba8      0xbff8fc68      0xbfffebd4
0xbff8fbf0:     0xbfffeba8      0xbfffeba8      0xbff8fc58      0x000af489
0xbff8fc00:     0xbfffebd4      0xbfffebd4      0xbfffebd4      0x008536e0
0xbff8fc10:     0xbfffebd4      0xbff8fc20      0x00000008      0x000ad65e
0xbff8fc20:     0x00002006      0xbfffeba8      0xbff8fca8      0x000ad6a1
0xbff8fc30:     0xbfffebd4      0xbfffebd4      0x00853560      0x008535c0
0xbff8fc40:     0x0080c200      0xbfffecb8      0x00000001      0xbfffeba8
0xbff8fc50:     0xbfffeba8      0xbfffebd4      0xbff8fca8      0x00853620
0xbff8fc60:     0xbfffeba8      0xbfffeba8      0xbff8fcd8      0x0009fa04
0xbff8fc70:     0x00000000      0x008537a0      0x00000000      0xbfffecba
0xbff8fc80:     0x008536e0      0xbfffeba8      0x0000000c      0x00000376
0xbff8fc90:     0xbfffeba8      0x00000000      0xbfffecb8      0x000ad93e
0xbff8fca0:     0xbfffeba8      0x00000002      0x00000001      0xffffffff
0xbff8fcb0:     0xffffffff      0xffffffff      0x00000000      0xbff8fea4
0xbff8fcc0:     0x00000000      0x00000000      0xbfffecb8      0xbfffeba8
0xbff8fcd0:     0xbfffeba8      0x00ffeba8      0x00000030      0xbfffebd4
0xbff8fce0:     0xbfffeba8      0xbfffeba8      0xbff8fd58      0xbfffe8a0
0xbff8fcf0:     0xbfffeba8      0x00853710      0xbff8fef8      0x000b4b4f

That's all 512 bytes of it.  Most of it seems .... redundant.
0xbfffeba8 in this case is |this|.  It almost looks like we're failing to pop the stack after making function calls or something?
An extern "C" vs. C++ mixup? Or just a Mac GCC -O3 bug? Need help reducing.

/be
At least for the sample page I reported this against, it's a regression over FF3.5 - would the reason for that help with isolation? Different compiler version/settings in the build system? Different parser code? Different/Newly introduced stacksize limit?
Florian: we changed to -O3 for 3.6 -- see comment 21, which bisected to find the regressing changeset.

/be
I don't think this blocks, but if it's an easy backport, we'd take it on mozilla-1.9.2
blocking1.9.2: ? → -
problem obviously still present in 

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.4) Gecko/20100513 Firefox/3.6.4

I've now had to switch our company from Firefox to Safari and Chrome. :-(
blocking2.0: beta1+ → beta2+
blocking2.0: beta2+ → betaN+
Can we get an owner here, please?
This patch hoists the fat cases out of Parser::statement and pushes the declaration of locals to the tightest scope.  On Linux, this shrinks the frame from 224 to 208 bytes.  On OS X 10.5 (with --enable-profiling so I can actually use info frame) the frame shrinks from 640 bytes to 304.  I can confirm the page in comment 0 fails without the patch and loads a bunch of controls with the patch.
Assignee: general → lw
Status: NEW → ASSIGNED
Attachment #488540 - Flags: review?(jwalden+bmo)
(In reply to comment #41)
> I've now had to switch our company from Firefox to Safari and Chrome. :-(

FF 4.0 sounds like a good time to switch back. :-)
Oh, I should also add that ParseMark and SS both say "Meh." on the patch, so no slowdown.
Does the patch really fix the problem, or does it just double the number of "else if"s we can handle?
(In reply to comment #46)
> Does the patch really fix the problem, or does it just double the number of
> "else if"s we can handle?

There is no "real fix"; as comment 27 pointed out, there are any number of ways to hit the stack limit by causing the parser to recur.  This patch just removes a case where we hit the limit way too soon.  While parsing else-if chains could be made to stay inline, comment 27 seems to indicate that that either hard or not possible.
Comment on attachment 488540 [details] [diff] [review]
cut up Parser::statement

This doesn't fix the bug as originally filed, just wallpapers a bit.  This bug is hopelessly overflowing with commentary for anyone looking to tackle it, and now it has "patches" implementing distracting hacks that make some instances of it disappear.  Thus maybe it's best to file a new bug specifically to make if-(else if){n}-else parsing not require O(n) stack space, and just close this one as "kind of sort of fixed for the original use, but really we're just closing because this is a mess".

I considered requesting meaningful names for parse node pointers while you were here, but the review effort was tedious enough as it was without having to handle variable renames at the same time.
Attachment #488540 - Flags: review?(jwalden+bmo) → review+
Comment 48 has good suggestions for followup bugs. The most important one is to parse if-else-if without recursion.

/be
http://hg.mozilla.org/tracemonkey/rev/f048a94e5f27
Whiteboard: fixed-in-tracemonkey
(In reply to comment #48)
> This doesn't fix the bug as originally filed

For better or for worse, it fixes the bug exactly as filed: a regression in the stack frame size of Parser::statement.  Making if/else iterative sounds great; I'm just interested in fixing blockers.
Filed bug 610564.
http://hg.mozilla.org/mozilla-central/rev/f048a94e5f27
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.