1755864 - Process using 100% of the CPU on the main thread recording infinite recursion error stacks

Reporter

Description

•

3 years ago

When I came back to my laptop after the night, the fans were noisy. In about:processes I found that one content process for google.com was using 100% of a core on its main thread.

I captured a profile of this process: https://share.firefox.dev/3HZ3rkr

I also have a longer marker-only profile, as I kept the profiler running all night to chase a graphics bug: https://share.firefox.dev/3rWaDZ2

From the profile, it seems we spend most of the time reporting the excessive recursion (js::ReportOverRecursed), and allocating memory to capture and report the massive JS stack (js::ErrorObject::getStack, js::CaptureStack) of the error.

I wonder if we could have stopped the script execution entirely at some point. I think that tab was in the background.

According to the Jank marker in the profile, this had been going on for 4502s.

Nicolas B. Pierron [:nbp]

Comment 1

•

3 years ago

This bug is an idea of a work-around to prevent pages from blocking web pages while they are catching all errors, such as recursion errors, and constantly retry while not changing any parameters.

I think this is a good idea to prevent these kind of behavior, the question is whether this is something which we could apply without breaking the spec?

One idea might be to slowly decrease the available stack space as we repeatedly run into recursion errors, within a single execution.
However, doing so might yield so other errors such as how do we handle cases where a recursion error has to be raised while handling the error.
Another issue would be that the stack limit is shared across all realms, which would change behavior of code running within iframes.

One other idea, would be to have a counter on the catch statement, and limit the number of recursion errors per catch statement. Thus forwarding the recursion error if we keep hitting the same.

Yulia, Jan, do you have any idea if this is ""spec""-compliant and doable?

Severity: -- → S4

Flags: needinfo?(ystartsev)

Flags: needinfo?(jdemooij)

Priority: -- → P3

Nicolas B. Pierron [:nbp]

Updated

•

3 years ago

Blocks: sm-runtime

Iain Ireland [:iain]

Comment 2

•

3 years ago

It sounds like a lot of the performance impact is caused by the stack capture, so an alternative would be to decrease the maximum number of frames we capture here when ReportOverrecursed is called too much.

Jan de Mooij [:jandem]

Comment 3

•

3 years ago

(In reply to Nicolas B. Pierron [:nbp] from comment #1)

This bug is an idea of a work-around to prevent pages from blocking web pages while they are catching all errors, such as recursion errors, and constantly retry while not changing any parameters.

I think this is a good idea to prevent these kind of behavior, the question is whether this is something which we could apply without breaking the spec?

My gut feeling is that changing the stack limit gradually or 'messing' with the script in other ways might cause more issues than it solves, because there are legitimate reasons for reaching the stack limit. Maybe we could improve the slow-script mechanism to handle this better if needed.

(In reply to Iain Ireland [:iain] from comment #2)

It sounds like a lot of the performance impact is caused by the stack capture, so an alternative would be to decrease the maximum number of frames we capture here when ReportOverrecursed is called too much.

Changing that MAX_REPORTED_STACK_DEPTH constant to a function based on the number of stack captures we did for exceptions in that realm until this point could work. That said, I don't know if it would help here: if the website is stuck in a loop throwing and catching overrecursion exceptions, we'd just burn the cycles elsewhere.

Flags: needinfo?(jdemooij)

Yulia Startsev [:yulia]

Comment 4

•

3 years ago

In terms of the spec: I think that we won't be able to do something spec wise. However, if we know for sure that we are dealing with error behavior, we might be able to do something outside of the spec and still be compliant.

On the other hand, I know that react uses errors as a way to implement continuations (i don't know if it is still up to date: https://reactjs.org/docs/concurrent-mode-suspense.html and a simplified example: https://gist.github.com/sebmarkbage/2c7acb6210266045050632ea611aebee). I think if we limit or try to block error recursion, we might break anything that relies on that kind of behavior?

Flags: needinfo?(ystartsev)

Florian Quèze [:florian]

Reporter

Comment 5

•

2 years ago

I reproduced this again today. It's probably due to a bug on a google's side (today it happened in a content process that had only 2 news.google.com tabs), but I still think we should do something about it to avoid using so much CPU when this happens.

Today's profile: https://share.firefox.dev/3PDg3Bl with an impressive 77704s (more than 21h!) jank marker.

Performance Impact: --- → ?

Florian Quèze [:florian]

Reporter

Updated

•

2 years ago

Keywords: power

Bas Schouten (:bas.schouten)

Comment 7

•

2 years ago

Do we know if this reproduces in Chrome?

Flags: needinfo?(florian)

Florian Quèze [:florian]

Reporter

Comment 8

•

2 years ago

(In reply to Bas Schouten (:bas.schouten) from comment #7)

Do we know if this reproduces in Chrome?

I don't know.

Flags: needinfo?(florian)

Markus Stange [:mstange]

Comment 9

•

2 years ago

I'm going to set this to high even though we don't really have an idea of how common this is. I've specified "Websites affected: Major" because it's a google site, even though it might only happen rarely on this site. Not sure if this is the best scoring.

The Performance Impact Calculator has determined this bug's performance impact to be high. If you'd like to request re-triage, you can reset the Performance Impact flag to "?" or needinfo the triage sheriff.

Platforms: macOS
Impact on site: Renders site effectively unusable
Websites affected: Major
[x] Causes severe resource usage

Performance Impact: ? → high

Markus Stange [:mstange]

Comment 10

•

2 years ago

I'm confused why this doesn't hit some kind of "long-running script" limit. Can't we just terminate runaway scripts like these?

Benjamin De Kosnik [:bdekoz]

Comment 11

•

2 years ago

Curious as to why the background tab is reporting recursion errors but not long script errors? Equal Play for All Errors!

robobenklein

Comment 12

•

2 years ago

I used to have this error as described, however today I notice only the memory usage without the CPU time.

FF 104.0.2 on Linux, https://www.recaptcha.net subframe using 11GB of memory, but nearly zero cpu.

Florian Quèze [:florian]

Reporter

Comment 13

•

2 years ago

(In reply to Benjamin De Kosnik [:bdekoz] from comment #11)

Curious as to why the background tab is reporting recursion errors but not long script errors?

I think it's because in bug 1694229 we changed the slow script warning so that it only appears when there has been user interaction with the tab.

The profiles in this bug show "JS::InterruptCallback" markers, so JS engine offers us the opportunity to stop the long running script, we just don't take it.

BugBot [:suhaib / :marco/ :calixte]

Comment 14

•

2 years ago

The severity field for this bug is set to S4. However, the Performance Impact field flags this bug as having a high impact on the performance.
:sdetar, could you consider increasing the severity of this performance-impacting bug? Alternatively, if you think the performance impact is lower than previously assessed, could you request a re-triage from the performance team by setting the Performance Impact flag to ??

For more information, please visit auto_nag documentation.

Flags: needinfo?(sdetar)

Steven DeTar [:sdetar]

Updated

•

2 years ago

Flags: needinfo?(sdetar)

Iain Ireland [:iain]

Updated

•

2 years ago

Performance Impact: high → ?

Iain Ireland [:iain]

Comment 15

•

2 years ago

Based on conversations at the perf team work week, I think this would not be marked as high impact according to the current standards. Requesting retriage.

Greg Mierzwinski [:sparky]

Comment 16

•

1 years ago

:florian, can you provide a reduced test case for this issue so that we can reproduce it?

Flags: needinfo?(florian)

Florian Quèze [:florian]

Reporter

Comment 17

•

1 years ago

Attached file Test case — Details

The attached testcase will reproduce the behavior of permanently using 100% of a CPU core while in the background. Just load the testcase in a tab and then select another tab.

Profile of the testcase: https://share.firefox.dev/45NC8G8

Flags: needinfo?(florian)

Florian Quèze [:florian]

Reporter

Comment 18

•

1 years ago

(In reply to Iain Ireland [:iain] from comment #15)

Based on conversations at the perf team work week, I think this would not be marked as high impact according to the current standards. Requesting retriage.

Reproducible on all platforms, causes severe resource usage, multiple reporters, the affected websites were major (google.com), this still scores as high.

Performance Impact: ? → high

Iain Ireland [:iain]

Comment 19

•

1 years ago

Hmm. I think the performance calculator should distinguish between "this consistently affects a major site" and "this affected a major site once when there was buggy code on it", but sure.

Jan and I discussed this. Our ability to fix it inside SpiderMonkey is limited. We can make throwing repeated errors cheaper (by, for example, capturing smaller stacks), but as Jan pointed out in comment 3, if we're stuck in an infinite loop, then it doesn't really matter what we're doing while we spin in circles: we're still going to be using 100% of the CPU. We don't want to play around with no longer throwing errors at all, because it runs the risk of correctness/webcompat issues.

Instead, this should probably be fixed in the browser, where there's more context. As mentioned in comment #13, SM is giving the browser the option to terminate the script, but we're not taking advantage of that opportunity.

I'm going to redirect this bug to XPCOM to match bug 1694229. If it helps, it would be easy for the JS engine to keep track of how many times we've called ReportOverRecursed, and make that number available to Gecko.

Component: JavaScript Engine → XPCOM

Keywords: regression

Regressed by: 1694229

BugBot [:suhaib / :marco/ :calixte]

Comment 20

•

1 years ago

Set release status flags based on info from the regressing bug 1694229

:dthayer, since you are the author of the regressor, bug 1694229, could you take a look?

For more information, please visit BugBot documentation.

status-firefox114: --- → affected

status-firefox115: --- → affected

status-firefox116: --- → affected

status-firefox-esr102: --- → affected

Flags: needinfo?(dothayer)

Alex Thayer [:alexical] (she/her)

Comment 21

•

1 years ago

I don't understand based on the scrollback here how this is any different than any other tab which is caught in an infinite loop? The difference between us and chrome here seems to be purely in the fact that the window title changes cause our memory usage to explode but not chrome's? Am I misunderstanding something? If we want to undo the decision in bug 1694229, that's fine I guess, but it seems to me that it is working as intended.

Flags: needinfo?(dothayer)

Pascal Chevrel:pascalc

Updated

•

1 years ago

status-firefox114: affected → wontfix

status-firefox-esr102: affected → wontfix

Donal Meehan [:dmeehan]

Updated

•

1 years ago

status-firefox115: affected → wontfix

Florian Quèze [:florian]

Reporter

Updated

•

1 year ago

Updated

•

1 year ago

status-firefox116: affected → wontfix

Frank Doty [:fdoty]

Comment 22

•

8 months ago

:nika can you review this bug and determine if the severity matches the performance impact? A High impact would translate to a S2. If you feel this is an S3, can you comment on the mismatch between our calculations and yours and we can reassess the impact?

Flags: needinfo?(nika)

Nika Layzell [:nika] (ni? for response)

Comment 23

•

8 months ago

I'm not super familiar with the situation around slow script warnings and this. It seems like perhaps we could do something more here on the Gecko side as per comment 19 (perhaps making the decision to interrupt background JS even if the user isn't interacting after a certain point?), but that seems like a decision to change the approach we took in bug 1694229.

Redirecting to :alexical who has more insight here, and can help make a call.

Flags: needinfo?(nika) → needinfo?(dothayer)

Alex Thayer [:alexical] (she/her)

Comment 24

•

8 months ago

Let's say:

I'm a browser user using a web tool which does a bunch of processing in JS and can use up a bunch of CPU and memory to do its job
I regularly use this tool
Since I know it takes a long time, I initiate the workload and then tab out to do something else. I'll check in in a few minutes to see the result

If we just silently interrupt it, we break their workflow. Last I checked, Chrome does not do this. If we alert the user, we annoy them. Chrome also does not do this. I do not see a compelling reason to deviate from Chrome's behavior here.

There might be some smarter things we could do, but in either case I don't see a compelling reason to keep this in the Performance Impact: high bucket.

Performance Impact: high → low

Flags: needinfo?(dothayer)