Closed Bug 1244280 Opened 4 years ago Closed 2 years ago

Linux javascript call stack greatly limited when using baseline jit

Categories

(Core :: JavaScript Engine, defect)

44 Branch
x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox58 --- fixed

People

(Reporter: chrmod, Assigned: arai)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 3 obsolete files)

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0
Build ID: 20160125133541

Steps to reproduce:

On Linux Javascript Error "too much recursion" happens far more often then on other platforms. 

Simple benchmark:
```
var i = 0;
function recurse() {
  i++;
  return recurse()
} 
try {
  recurse()
} catch(e) {
  console.log(i)
}
```

on Linux tested with js shell options:
* --baseline: gives random number between 5-20k
* --no-baseline: gives precise number of 49999 

on Linux tested with Firefox 44 with pref:
* javascript.options.baselinejit = true: gives random number close to 5k
* javascript.options.baselinejit = false: gives precise number of 50992

Other benchmarks like this: http://jsfiddle.net/josh3736/Z5pLM/ - on Firefox, gives results very similar to js shell, that is stack size is a random number between 5-20k.

Same tests run on MacOS tend to return something around 50k.

V8 on the other hand gives a number around 20k, independently of platform or  environment (node, chromium).



Expected results:

Core problem here is the fact that stack size tend to be random. Everything is OK when the number is close to 20k (so close to V8), but when it hits the lower end of 5k, applications start to crash ("too much recursion" Error). 

As of its random nature this problems tend to be hard to reproduce by application developers and as it happens on Linux it's often neglected. 

I've first noticed this problem around year ago, but may be as old or older than https://bugzilla.mozilla.org/show_bug.cgi?id=966173
Can someone explain what would be consequences of turning off baselinejit in Firefox on Linux?
OS: Unspecified → Linux
(In reply to Krzysztof Jan Modras from comment #1)
> Can someone explain what would be consequences of turning off baselinejit in
> Firefox on Linux?

Simple, if Baseline is disabled, then IonMonkey is disabled too, which implies that all the execution remains in the Interpreter.  So the consequences are executions which are at best 100x slower.

The problem that you are seeing is that Jits (Baseline, IonMonkey) are using the C stack, and the C stack can be limited by the system (ulimit), and by Firefox (reserved safe stack space).

Running in the JS shell:
  … : 86860
  --no-ion: 14471
  --no-baseline: 49999

I guess the browser reduce the stack space available for Baseline, which make this number go down to ~8k.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Thanks for the explanation.

ulimit was my first guess, but system level stack size limit would affect V8 - which reports const stack size of ~20k.

Noticed something interesting about this test on Linux: http://jsfiddle.net/josh3736/Z5pLM/
If you refresh it fast couple times in a row, you will notice the number is getting smaller and smaller. But if you will wait a moment and refresh again it will get bigger again. 
- This behavior may imply that available stack size can be reserved and freed over time.
Hardware: Unspecified → x86_64
Attached patch linux-stack.patch (obsolete) — Splinter Review
It looks like Linux stack size was not fine tuned towards modern desktop systems. Attached patch make Linux builds use same configuration as MacOS ones. 

With this patch applied measured stack size is around 280k - which is obviously far more than needed, but same as on MacOS, which I cannot find a reason to be different.
Flags: needinfo?(past)
Comment on attachment 8849224 [details] [diff] [review]
linux-stack.patch

nbp, do you see any issues with raising the stack size on Linux? It sounds like a leftover from back when the defaults were more conservative in major distros. Our partners at Cliqz have been getting error reports about this in the field.
Flags: needinfo?(past)
Attachment #8849224 - Flags: review?(nicolas.b.pierron)
Can we get the stack limit dynamically with getrlimit or something?
Attached patch linux-stack2.patch (obsolete) — Splinter Review
Attached patch gets current limit from `getrlimit`. Not sure if it requires some edge case handling.
Flags: needinfo?(jdemooij)
Sorry for the delay. Some questions:

* What value do you get for kStackQuota with this patch? 8 MB?

* We shouldn't use the limit we get from getrlimit but subtract something to be safe (on Windows and OS X we also don't use the full stack size). Let's say 128 KB or so.

* Could you add some sanity checks to use the current limits if the value we get is very small and also clamp it to 8 MB if it's larger than that?
Flags: needinfo?(jdemooij) → needinfo?(krzysztof.modras)
(In reply to Krzysztof Jan Modras from comment #4)
> It looks like Linux stack size was not fine tuned towards modern desktop
> systems. Attached patch make Linux builds use same configuration as MacOS
> ones. 

This is actually incorrect, as gcc compiles differently than clang.
Also, the kDefaultStackQuota is supposed to be used for Linux systems.

(In reply to Panos Astithas [:past] (please needinfo?) from comment #5)
> nbp, do you see any issues with raising the stack size on Linux? It sounds
> like a leftover from back when the defaults were more conservative in major
> distros. Our partners at Cliqz have been getting error reports about this in
> the field.

Can you run js/xpconnect/tests/chrome/test_bug732665_meta.js in the JS shell and report the results, then modify these numbers accordingly.

This limit is not supposed to be reached in normal cases, and I do not recall seeing reports telling us to bump this limit again, especially since we have a test[1] to warn us when we have to bump this limit.

Can you detail what they are doing which makes them reach this limit?

[1] js/xpconnect/tests/chrome/test_bug732665.xul
Flags: needinfo?(past)
(In reply to Nicolas B. Pierron [:nbp] from comment #9)
> (In reply to Krzysztof Jan Modras from comment #4)
> > It looks like Linux stack size was not fine tuned towards modern desktop
> > systems. Attached patch make Linux builds use same configuration as MacOS
> > ones. 
> 
> This is actually incorrect, as gcc compiles differently than clang.

The comment was about the stack limit not the stack consumption :) Different compiler versions/flags also use more or less stack space.

> Can you detail what they are doing which makes them reach this limit?

It's not hard to hit the low stack limit on Linux. I think using getrlimit and using a larger stack quota is reasonable.
This is a question for Krzysztof, I don't have useful information to contribute. I'm just trying to help him move this forward here.
Flags: needinfo?(past)
(In reply to Jan de Mooij [:jandem] from comment #10)
> (In reply to Nicolas B. Pierron [:nbp] from comment #9)
> > (In reply to Krzysztof Jan Modras from comment #4)
> > > It looks like Linux stack size was not fine tuned towards modern desktop
> > > systems. Attached patch make Linux builds use same configuration as MacOS
> > > ones. 
> > 
> > This is actually incorrect, as gcc compiles differently than clang.
> 
> The comment was about the stack limit not the stack consumption :)

The comment is fine, except that it might be debatable that "Modern desktop" set this limit and not just the "Linux distributions".

My comment was about the changes in the ifdef, which set different stack quota, which is why I asked to get the measure from the test case.
Comment on attachment 8849224 [details] [diff] [review]
linux-stack.patch

Re-ask for review after replying to comment 9 questions.
Attachment #8849224 - Flags: review?(nicolas.b.pierron)
running `./mach test js/xpconnect/tests/chrome/test_bug732665.xul` towards attachment 8849641 [details] [diff] [review] causes tests to crash with:

>  TEST-INFO | Main app process: killed by out-of-range signal, number 139
>  2 ERROR TEST-UNEXPECTED-FAIL | js/xpconnect/tests/chrome/test_bug732665.xul | application terminated with exit code -139

after reducing the stack size by 128k, test no longer crashes and results with:

>  2 INFO TEST-PASS | js/xpconnect/tests/chrome/test_bug732665.xul | Chrome should be able to have at least 10 heavy frames more stack than content: 1542, 1507 
>  3 INFO TEST-PASS | js/xpconnect/tests/chrome/test_bug732665.xul | Chrome should be invokable from content script with an exhausted stack: 36 
>  GECKO(8266) | MEMORY STAT vsizeMaxContiguous not supported in this build configuration.
>  GECKO(8266) | MEMORY STAT | vsize 1971MB | residentFast 339MB | heapAllocated 154MB
>  4 INFO TEST-OK | js/xpconnect/tests/chrome/test_bug732665.xul | took 1119ms

I've also tested `./run-mozilla.sh ./xpcshell test.js` with given `test.js` script:

> var i = 0; function r() { i++; r() }; try { r() } catch(e) { dump(i+'\n') }

with attachment 8849641 [details] [diff] [review] it crashed with segmentation fault, with reduced stack size it prints number around 320k. Default Linux build result with value around 28k.
Flags: needinfo?(krzysztof.modras) → needinfo?(nicolas.b.pierron)
As of the use case that lead me to discovering the problem. I'm working at cliqz.com - where we build Firefox and Firefox addon that replaced default urlbar dropdown. We were reaching stack limits on our standard user interaction that is rendering search results in the dropdown. Our javascript codebase is relatively big and does more all less the following: 
* for every keystroke in urlbar input field it fetches search results from backend, mix and deduplicate them with history results
* pass the results through autocomplete provider (replacement of unifiedcomplete) 
* enhance some of the results with additional local business logic
* render the templates with handlebars

We had to cut that stack in two places using setTimeout(... , 0) in two places to finally have version that works on Linux in most cases. 

We've seen situation that solely handlerbars rendering (templates + helpers) generate a stack higher that 4k. 

I believe that sort of problems are experienced in the wild but are rarely identified as "too much recursion" error  - as this error can happen on any function call it is likely to be swallowed by application / library / framework code and may result in something as trivial as not responding button that works after a second click. 

What is your opinion about it?
Attached patch linux-stack3.patch (obsolete) — Splinter Review
I've updated the patch to reserve some stack size (128k) - this makes the test to pass with given results:
> 2 INFO TEST-PASS | js/xpconnect/tests/chrome/test_bug732665.xul | Chrome should be able to have at least 10 heavy frames more stack than content: 1542, 1508
> 3 INFO TEST-PASS | js/xpconnect/tests/chrome/test_bug732665.xul | Chrome should be invokable from content script with an exhausted stack: 35
> GECKO(11058) | MEMORY STAT vsizeMaxContiguous not supported in this build configuration.
> GECKO(11058) | MEMORY STAT | vsize 1958MB | residentFast 338MB | heapAllocated 138MB
Attachment #8849224 - Attachment is obsolete: true
Attachment #8849641 - Attachment is obsolete: true
Comment on attachment 8860625 [details] [diff] [review]
linux-stack3.patch

Review of attachment 8860625 [details] [diff] [review]:
-----------------------------------------------------------------

A part from the following comment, this looks good.

::: js/xpconnect/src/XPCJSContext.cpp
@@ +3477,5 @@
>      const size_t kTrustedScriptBuffer = sizeof(size_t) * 12800;
> +#elif defined(XP_LINUX)
> +    // Linux distributions set default stack size to 8MB
> +    struct rlimit rlim;
> +    getrlimit(RLIMIT_STACK, &rlim);

Check the error code before changing the stack values with potentially uninitialized values.
(In reply to Krzysztof Jan Modras from comment #15)
> We had to cut that stack in two places using setTimeout(... , 0) in two
> places to finally have version that works on Linux in most cases. 

Have you look at Promise?  This is a good way to get rid of the callback hell, and this reduce the stack depth too.

> We've seen situation that solely handlerbars rendering (templates + helpers)
> generate a stack higher that 4k. 
> 
> […] 
> 
> What is your opinion about it?

Did you identify which functions were on the stack while these error happened?

Isn't that reported by the dev-tool?  I guess that is the kind of things that we should always report to the user, even if it is caught by the code, as this would certainly cause different behaviour as you described.
Flags: needinfo?(nicolas.b.pierron)
See Also: → 1412773
What can we do to push this one forward? 

IMO the stack limit on Linux is far too low and I don't see a reason why that platform should have lower settings than Windows or OSX. If dynamic detection of stack size is not satisfactory, then lets at least set the default number to same value as for Windows or OSX. We can work on a better solution over time. 

(Have just noticed this issue is opened for two years - we could have anything better so long time ago..)
Flags: needinfo?(jdemooij)
Landing something like the patch here makes sense to me. Forwarding to arai since this is so similar to bug 1412773.
Flags: needinfo?(jdemooij) → needinfo?(arai.unmht)
checked linux-stack3.patch on m-c rev aa958b29c149, with bug 1412773 test case,
and surely it fixes the "too much recursion" issue.
I agree that we need to land this patch.

Krzysztof, are you going to fix the patch? (error handling, minimum value, etc, mentioned above)
or, can I take this over?
Flags: needinfo?(arai.unmht) → needinfo?(krzysztof.modras)
also, confirmed the limit that getrlimit returns is 8MB for me too, on ubuntu (64bit).
Jan, Arai, thank you for your responses.

Arai, I would appreciate if you could take it over - I'm not much of C++ dev and it would take me a while to recreate the setup. But if you rather prefer, me taking care of it, please, by any means let me know.
Flags: needinfo?(krzysztof.modras) → needinfo?(arai.unmht)
Okay, thanks!
Taking over.
Assignee: nobody → arai.unmht
Status: NEW → ASSIGNED
Flags: needinfo?(arai.unmht)
(now I'm cleaning up other stack size calculation)

IMO, we should look into android case as well.
we could use getrlimit there, instead of static value.
Now Linux (except Android) uses getrlimit+RLIMIT_STACK to retrieve stack quota.  The value is restricted to [1MB, 8MB-128kB] range (128kB is the safe margin), and the safe margin is also applied to the result of getrlimit+RLIMIT_STACK, so that there's always 128kB margin as long as there's 1.1MB actual stack space.

Linux ASan case is embedded into the Linux case's branch, since the value form getrlimit can be greater than default ASan case's quota.

Linux's kTrustedScriptBuffer is bumped to 180kB from 128kB, since most distro uses 8MB.
(I could add some branches to use original 128kB/64kB value for some case, but I don't think it worth)

The last 2 branches for opt/debug are now used by other configurations, maybe other UNIX.  We could detect the presence of getrlimit in configure script and use it, but it's for other bug.
That said, we could use getrlimit also on macOS and Android, but I'd leave them to other bugs, to make it easier to backout on trouble.  getrlimit returns 8.3MB on macOS, that matches to the current hardcoded value, but it returns 8MB on Android emulator for me, that doesn't match to current hardcoded value (0.75MB). not sure if it's emulator specific thing...

After the patch, pre-existing standalone ASan case is used only by Windows/Android/other cases,  and I think the quota value seems to be wrong for Windows/Android cases, since the value is be greater than the almost-maximum value used on non-ASan case.
I'll file a bug for it later.

Also, removed stale and actually unrelated comment from js/rust/src/rust.rs.
Attachment #8860625 - Attachment is obsolete: true
Attachment #8925244 - Flags: review?(nicolas.b.pierron)
Comment on attachment 8925244 [details] [diff] [review]
Use system stack limitation instead of hardcoded smaller value on linux.

Review of attachment 8925244 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/rust/src/rust.rs
@@ -36,5 @@
>  
>  // From Gecko:
> -// The JS engine permits us to set different stack limits for system code,
> -// trusted script, and untrusted script. We have tests that ensure that
> -// we can always execute 10 "heavy" (eval+with) stack frames deeper in

nit: Add a reference to the XPCJSContext.cpp file.

::: js/xpconnect/src/XPCJSContext.cpp
@@ +1100,5 @@
>      // ASAN builds have roughly thrice the stack overhead as normal builds.
>      // On normal builds, the largest stack frame size we might encounter is
>      // 9.0k (see above), so let's use a buffer of 9.0 * 5 * 10 = 450k.
> +    //
> +    // FIXME: Does this branch make sense for Windows and Android?

follow-up: I do not think it matters yet, so this branch seems to be technically dead.  I guess we should remove it in a follow-up or move it under ANDROID / XP_WIN guards.  It might be good to ask a build-system peer as well, to know if these are valid configuration that we are supporting, or intent to support.
Attachment #8925244 - Flags: review?(nicolas.b.pierron) → review+
Blocks: 1415195
Pushed by arai_a@mac.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/04d539bc9bca
Use system stack limitation instead of hardcoded smaller value on linux. r=nbp
https://hg.mozilla.org/integration/mozilla-inbound/rev/04d539bc9bca98f22ca51de00a40eceb516405e3
Bug 1244280 - Use system stack limitation instead of hardcoded smaller value on linux. r=nbp
https://hg.mozilla.org/mozilla-central/rev/04d539bc9bca
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Duplicate of this bug: 1412773
Duplicate of this bug: 966173
Duplicate of this bug: 1006134
You need to log in before you can comment on or make changes to this bug.