Closed Bug 644241 Opened 13 years ago Closed 13 years ago

Remove script stack quota

Categories

(Core :: JavaScript Engine, defect)

Other Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: azakai, Assigned: azakai)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files, 3 obsolete files)

Bug 502836 raised the script stack space quota significantly, and made it depend on available memory. However, Web Workers are apparently still quite limited - for example I have a 7MB script that runs fine in a <script> element in a web page, but hits "script stack space quota is exhausted" when a Web Worker tries to load it using importScripts().

Looking in the code, there are various places where JS_SetScriptStackQuota() is called, with different values. What are the considerations here? Do we purposefully want to limit certain contexts? Or should we change them all to a value that allows large JS applications?
Blocks: 644244
From what I understand by reading the code, the script stack space quota is an optional limitation. If a NULL pointer is given when calling JS_InitArenaPool (as the quotap parameter), then no quota checking is done at all.

What would be the downside to simply not enforcing a script stack quota at all in Firefox? (I've been inconvenienced by this limitation a few times so far, and I am not sure I see the upside.) Is it meant to guard against malicious or poorly written scripts using too much system memory?
igor, I'm not really sure who to ask about this for feedback. Asking you since you reviewed that other patch with script stack quota.

The patch makes it so -1(unsigned) means no script space quota is enforced, and implements that in Firefox.

I want to suggest doing that, to avoid the current problem in this bug with web workers, and potential future problems as well. I do not see the usefulness of enforcing a quota. I guess it can protect us from a web page DOSing us, but we do allow 100MB for each context right now, so a script can just open several workers and DOS us that way with multiples of 100MB. (And, we can be easily DOSed by loading too many images anyhow...)
Attachment #528206 - Flags: feedback?(igor)
To disable quota checks the patch should set pool->quotap to null.
Are quotas used for any other pools?  Ie. can the quota feature be removed as part of this change?
(In reply to comment #1)
> Is it meant to guard against malicious or
> poorly written scripts using too much system memory?

Without the quota an almost trivial mistakes in scripts would lead to a browser crashing via swapping and OOM instead of presenting an error message. So the issue is about web-developer friendliness.

What quota is necessary for the test to pass? Also I suspect the bug is not in the quota enforcement but rather in bad scheduling of the GC for web workers.
(In reply to comment #5)
> (In reply to comment #1)
> > Is it meant to guard against malicious or
> > poorly written scripts using too much system memory?
> 
> Without the quota an almost trivial mistakes in scripts would lead to a browser
> crashing via swapping and OOM instead of presenting an error message. So the
> issue is about web-developer friendliness.
> 
> What quota is necessary for the test to pass? Also I suspect the bug is not in
> the quota enforcement but rather in bad scheduling of the GC for web workers.

Well, the problem is finding the right quota. The last time around (bug 502836) the fix was to use 1/4 of total system RAM. But that's just a heuristic and it is surely wrong in many cases (if lots of system memory is used by other apps, or system memory changes, etc.). It does help in the cases we have encountered so far though, and it would fix the case that led me to file this bug.

So we can do the same for web workers as well I guess. Would that be a better solution here? I do agree that preventing OOM due to a script error is a good thing to have. (I just wish there was a better solution here, so we would never have to waste time on script stack quotas again.)
(In reply to comment #4)
> Are quotas used for any other pools?  Ie. can the quota feature be removed as
> part of this change?

Yes, the quoata can be removed but only after we find out the reason it does not work for web workers.
(In reply to comment #7)
> (In reply to comment #4)
> > Are quotas used for any other pools?  Ie. can the quota feature be removed as
> > part of this change?
> 
> Yes, the quoata can be removed but only after we find out the reason it does
> not work for web workers.

The technical reason for the issue is that we set the script stack quota to a fixed 100MB for web workers, and that isn't enough for big scripts. (But perhaps there is a nontechnical reason behind it, to purposefully limit web workers? I hope not.)

There are 5 places in the code where we set the script stack size: 1 for the js shell, 1 for the normal web page context, 1 for dom threads (web workers), and 2 for tab context (in and out of process).

I guess we should set the stack quota to the same large value we use since bug 502836 (1/4 of physical memory) for the normal web context, for all of these? I'm not sure where to put a single function that returns that size though. (What appropriate code location can be accessed by both the js shell, xpconnect, and gecko code?)
Due to recent changes the quota is no longer an effective defense against simple bugs as it does not limit the amount of script stack a script can consume. These days it mostly restis the amount of memory the parser can use. So I am in favor of removing the quota altogether. But we would need a separated bug about our parser/compiler taking too much memory when compiling large scripts.
Ok, should I write a patch to remove the quota handling altogether, or just disable it being used in Firefox?
(In reply to comment #9)
> But we would need a separated bug about our parser/compiler taking too 
> much memory when compiling large scripts.

Bug 626932 covers most of the parsing side.
(In reply to comment #10)
> Ok, should I write a patch to remove the quota handling altogether, or just
> disable it being used in Firefox?

Lets remove the code completely. The quota does not cover the stack space consumed by a script for quite some time so it does not make sense to keep API that has not been working.
So, something like this patch?

I am a little concerned though, while perhaps the quota was not perfect, it was protecting against some form of DOS, and this patch will make us vulnerable to that.

How about adding a check in the JS engine malloc() wrapper, to make sure it never allocates more than a total of (for example) 1/2 of physical memory?
Attachment #528206 - Attachment is obsolete: true
Attachment #528206 - Flags: feedback?(igor)
Attachment #528502 - Flags: feedback?(igor)
(In reply to comment #13)
> So, something like this patch?

Yes!

> 
> I am a little concerned though, while perhaps the quota was not perfect, it was
> protecting against some form of DOS, and this patch will make us vulnerable to
> that.

If one wants to DOS the browser over out-of-memory, he does not need to do in JS anything beyond:

var a = []; while (true) a.push(Array(1e7).join(''))

This results in creation of allocation of 20MB strings that the GC cannot collect. But it is rather hard to come up with such code deliberately.
Attachment #528502 - Flags: feedback?(igor) → feedback+
Assignee: general → azakai
Summary: "script stack space quota is exhausted" with Web Workers → Remove stack quota
Summary: Remove stack quota → Remove script stack quota
Attached patch patch without whitespace (obsolete) — Splinter Review
Ok, here is a patch that should work (patch ignores whitespace, to be easier to read, there are some indentation changes).

The problem (as always with scriptStackQuota) are the tests:

regress-338121-01.js, -02 and -03 fail with |No test results reported. (SCRIPT)| or with a timeout. The failure is because the call to |new Function(| aborts the script silently (without a catchable error), I presume due to OOM. The timeout happens instead sometimes on tryserver, I suspect also due to OOM.

regress-338001.js works for me locally but times out on tryserver, I suspect due to OOM.

The timeouts were presumably prevented earlier by the script stack quota - so there is in fact a step back with this approach. Perhaps it makes sense to do something like what I mentioned at the bottom of comment #13 (with a nice API, of course)?

Otherwise, I am not sure how to proceed at this point.
Attachment #528502 - Attachment is obsolete: true
(In reply to comment #15)
> regress-338121-01.js, -02 and -03 fail with |No test results reported.
> (SCRIPT)| or with a timeout. The failure is because the call to |new Function(|
> aborts the script silently (without a catchable error), I presume due to OOM.

The tests have to be changed to expect OOM instead of the error that can be caught. Add expectExitCode(0);  expectExitCode(5);  at the test start, remove try-catch and annotate them in js1_5/extensions/jstests.list as slow.

> regress-338001.js works for me locally but times out on tryserver, I suspect
> due to OOM.

The same here.

> The timeouts were presumably prevented earlier by the script stack quota - so
> there is in fact a step back with this approach.

The step back happened when we started to ignore the quota when dealing with the script stack space. Supporting quota in the area where it is hard to make bugs that would lead to OOM is waste of efforts.

> Otherwise, I am not sure how to proceed at this point.

Update the above tests, look at other tests that check for quota exceptions (grep for "InternalError: script stack space quota is exhausted") and update them is necessary. Then ask for a review.
I made those changes and pushed to try,

http://tbpl.mozilla.org/?tree=Try&rev=ce439e913727

but I still get the same problems.
Hmm, wait, I pushed an earlier version of the patch by mistake. Ignore the previous comment.
Didn't make a difference, still getting the same errors and timeouts,

http://tbpl.mozilla.org/?tree=Try&rev=34d703ca2014
(In reply to comment #19)
> Didn't make a difference, still getting the same errors and timeouts,

The patch may not mark all the tests that tries to hit the quota limit. When you run the tests locally, does it take more time with the patch to go through all the tests?
I have a problem running them locally, if I do so naively, then it locks up my machine while swapping and I usually need to reboot (a problem with my laptop or ubuntu I guess). So I run locally with ulimit, to not allow it to allocate too much memory. But in that case, there are no timeouts (just quick failures).

But, we can see which tests time out, so we can tell which are problematic, I think? The ones that time out, are marked. Or is it possible for one test to allocate memory in a way that affects a later test somehow, so a non-time outing test could be partially responsible too?
(In reply to comment #21)
> I have a problem running them locally, if I do so naively, then it locks up my
> machine while swapping and I usually need to reboot (a problem with my laptop
> or ubuntu I guess). So I run locally with ulimit, to not allow it to allocate
> too much memory. But in that case, there are no timeouts (just quick failures).

All tests that can too much memory should be marked as slow in jstests.list. Those are not run by default. So do you observe any extra failures with the patch when you run with ulimit? Also I would suggest to run with the the -t option like -t 20 to limit the time the test can run and then observe regressions with the patch.

> Or is it possible for one test to
> allocate memory in a way that affects a later test somehow, so a non-time
> outing test could be partially responsible too?

Each test is run as a separated process. So memory usage in one should not affect another.
(In reply to comment #22)
> (In reply to comment #21)
> > I have a problem running them locally, if I do so naively, then it locks up my
> > machine while swapping and I usually need to reboot (a problem with my laptop
> > or ubuntu I guess). So I run locally with ulimit, to not allow it to allocate
> > too much memory. But in that case, there are no timeouts (just quick failures).
> 
> All tests that can too much memory should be marked as slow in jstests.list.
> Those are not run by default.

Very odd, I do see tests I marked as slow being run, both locally with

make -C ../.. jstestbrowser

and on tryserver, see link to try run above (for example, http://hg.mozilla.org/try/file/34d703ca2014/js/src/tests/js1_5/Function/jstests.list ).
(In reply to comment #23)
> Very odd, I do see tests I marked as slow being run, both locally .. and on tryserver

That explains the failures.

To Paul: do you know why we run the slow tests on the tryserver and with make jstestbrowser? Those that surely hit OOM makes the tryserver to timeout.
(In reply to comment #24) 
> To Paul: do you know why we run the slow tests on the tryserver and with make
> jstestbrowser? Those that surely hit OOM makes the tryserver to timeout.

No, I've only looked at the jstests harness, not the reftest one used by jstestbrowser.
I think we need to either disable tests that can OOM, or add some code that prevents OOMs (the script stack quota, although flawed, did fulfill that purpose).

Disabling the tests is obviously simpler, but we lose coverage, and we allow users to be vulnerable to more DOSes.

Preventing OOMs is possible, I have a very hacky wip patch that, together with this patch to remove the script stack quota, is green on try,

http://tbpl.mozilla.org/?tree=Try&rev=080e72d0355a
Remove script stack quota: http://hg.mozilla.org/try/rev/0e9feb649f52
Prevent OOMs: http://hg.mozilla.org/try/rev/080e72d0355a (don't take the code seriously, it's just a quick hack to see what can work)

The risk with preventing OOMs is that it might have false positives in rare cases (that is, malloc returns NULL even though it can and should allocate memory). The wip patch does work so far though. It

1. Adds a global callback function that the JS engine calls every time X bytes have been meanwhile allocated, and the callback can decide whether to allow the allocation or not.

2. In the global callback, it checks how much resident memory is used by the current process, and if that exceeds 90% of physical memory, does not allow it. (I tried using the about:memory counters too, which are in theory more precise, but it turns out they are very inaccurate.) So, the false positive here could be, a case in which the user *does* want Firefox to take over 90% (or whatever figure we decide) of physical memory. This seems highly unlikely, but possible.


Do we want to work on such an approach? Or should we just disable the OOMing tests and accept the potential DOS risks?
(In reply to comment #26)
> (I tried using the about:memory counters too, which are in theory more
> precise, but it turns out they are very inaccurate.)

Can you explain this some more?  Thanks!
(In reply to comment #27)
> (In reply to comment #26)
> > (I tried using the about:memory counters too, which are in theory more
> > precise, but it turns out they are very inaccurate.)
> 
> Can you explain this some more?  Thanks!

Building with the two patches mentioned above, and running the js reftests, at the end of them I get XPConnectJSStringData reporting 4GB is being used, while the resident memory is just 90MB.

Want me to file a bug for this? I can probably figure out which test causes it.

Regardless, for this bug, I'd rather use the resident memory stat, which the OS guarantees to be valid. Otherwise we could get very hard to understand bugs (as in, web pages not working right, the UI breaking, who knows) from any overestimation in the about:memory counters.
azakai, I'm about to land several patches to fix inaccuracies in about:memory, including bug 648490 which is about XPConnectJSStringData :)

The "resident" and "mapped/heap/used" counters should definitely be accurate, because we get them from the OS and the malloc implementation respectively.  The ones under those (from within Firefox) may have inaccuracies, though I've hopefully caught most of them by now.
Yep, bug 648490 looks like what I was seeing... :)
Igor, ping about comment 26? I need some guidance as to how to proceed here.
(In reply to comment #31)
> Igor, ping about comment 26? I need some guidance as to how to proceed here.

OOM preventing is for another bug. For now lets just disable oom tests that this patch affects.
Blocks: 657444
(In reply to comment #32)
> (In reply to comment #31)
> > Igor, ping about comment 26? I need some guidance as to how to proceed here.
> 
> OOM preventing is for another bug. For now lets just disable oom tests that
> this patch affects.

Sounds good. Filed 657444 for preventing OOMs.

Here is the patch.
Attachment #532796 - Flags: review?(igor)
This patch disables OOMing tests.

Combined with the first patch it looks good on try so far, http://tbpl.mozilla.org/?tree=Try&rev=ee75dcc95b14
Attachment #528914 - Attachment is obsolete: true
Attachment #532797 - Flags: review?(igor)
Attachment #532797 - Flags: review?(igor) → review+
Igor, review ping?
Comment on attachment 532796 [details] [diff] [review]
Remove script stack quota

Review of attachment 532796 [details] [diff] [review]:
-----------------------------------------------------------------

I forgot to r+ this patch after giving it to the part that disables the tests.
Attachment #532796 - Flags: review?(igor) → review+
Tiny followup for compiler warning:

http://hg.mozilla.org/tracemonkey/rev/ea71b04c158b
Documentation has been updated, despite this not having been tagged as requiring it. :)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: