Closed Bug 1527839 Opened 2 years ago Closed 7 months ago

Crash [@ js::jit::ExtractLinearSum] with stack space exhaustion

Categories

(Core :: JavaScript Engine, defect, P5)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
81 Branch
Tracking Status
firefox-esr68 --- wontfix
firefox-esr78 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix
firefox79 --- wontfix
firefox80 --- wontfix
firefox81 --- fixed

People

(Reporter: decoder, Assigned: kanishk509, Mentored)

References

Details

(4 keywords, Whiteboard: [bugmon:confirmed])

Crash Data

Attachments

(3 files)

The following testcase crashes on mozilla-central revision 2bf86657a448 (build with --enable-posix-nspr-emulation --enable-valgrind --enable-gczeal --disable-tests --disable-profiling --disable-debug --enable-optimize, run with --fuzzing-safe --cpu-count=2 --ion-offthread-compile=off --ion-gvn=off):

See attachment.

Backtrace:

received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5d01700 (LWP 19042)]
0x0000555555eb1fa2 in js::jit::ExtractLinearSum (ins=ins@entry=0x7ffff5f7b218, space=space@entry=js::jit::MathSpace::Modulo) at js/src/jit/IonAnalysis.cpp:3428
#0  0x0000555555eb1fa2 in js::jit::ExtractLinearSum (ins=ins@entry=0x7ffff5f7b218, space=space@entry=js::jit::MathSpace::Modulo) at js/src/jit/IonAnalysis.cpp:3428
#1  0x0000555555eb204c in js::jit::ExtractLinearSum (ins=ins@entry=0x7ffff5f7b328, space=space@entry=js::jit::MathSpace::Modulo) at js/src/jit/IonAnalysis.cpp:3461
[...]
#126 0x0000555555eb204c in js::jit::ExtractLinearSum (ins=ins@entry=0x7ffff5f85738, space=space@entry=js::jit::MathSpace::Modulo) at js/src/jit/IonAnalysis.cpp:3461
#127 0x0000555555eb204c in js::jit::ExtractLinearSum (ins=0x7ffff5f85888, space=js::jit::MathSpace::Modulo) at js/src/jit/IonAnalysis.cpp:3461
rax	0x7ffff5f7b2a0	140737320039072
rbx	0x7ffff5f7b328	140737320039208
rcx	0x0	0
rdx	0x0	0
rsi	0x0	0
rdi	0x7ffff5f7b218	140737320038936
rbp	0x7ffff5f7b218	140737320038936
rsp	0x7ffff5b05000	140737315360768
r8	0x0	0
r9	0x7ffff4500000	140737292271616
r10	0x0	0
r11	0x0	0
r12	0x7ffff5cfe900	140737317431552
r13	0x4000	16384
r14	0x0	0
r15	0x7ffff5f7b2a0	140737320039072
rip	0x555555eb1fa2 <js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace)+2>
=> 0x555555eb1fa2 <js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace)+2>:	push   %r14
   0x555555eb1fa4 <js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace)+4>:	push   %r13
Attached file Testcase

Steve, can you triage this and find someone to look into this regression crash?

Flags: needinfo?(sdetar)

Iain, do you think you could look at this regression fuzz bug? I believe Ted already talk to you about it.

Flags: needinfo?(sdetar) → needinfo?(iireland)

Yeah, looking into this now.

Flags: needinfo?(iireland)

This testcase builds a WASM function whose body is "0 + 0 + 0 + 0 ... + 0", with 65536 0's:

var lfLogBuffer = `
(function testBigOffset() {
  var builder = new WasmModuleBuilder();
  let body = [kExprI32Const, 0, kExprI32Add];
  while (body.length <= 65536) body = body.concat(body);
  body.unshift(kExprI32Const, 0);
  body.push(kExprUnreachable);
  builder.addFunction('main', kSig_v_v).addBody(body).exportFunc("bar");
  try {
    builder.instantiate().exports.main();
  } catch (e) {}
})();
`;

During Ion compilation, we reach FoldLinearArithConstants, which tries to analyze this addition. It calls ExtractLinearSum, which tries to recurse over what is essentially a 65K element linked list. We run out of stack space and segfault.

This only happens with global value numbering disabled. With GVN enabled, we compile successfully. (Pre-GVN, the .gv output for iongraph is 10MB. Post-GVN, it is 482 bytes. I haven't been able to build PDFs to see exactly what changed, but I think the numbers speak for themselves.)

It would be nice to rewrite ExtractLinearSum to avoid the recursion, but I don't think it's a high priority, given that this only occurs on huge inputs, and only when we've tied one hand behind our back.

I think this would be a good first bug for somebody to fix.

Mentor: iireland
Severity: critical → minor
Keywords: good-first-bug
Priority: -- → P5

P5, setting as fix-optional for 67 so as to not remove it from our weekly triage meeting.

Whiteboard: [jsbugmon:update,bisect] → [jsbugmon:update]

It would be astonishing if that were the cause of this bug.

autobisectjs shows this is probably related to the following changeset:

The first bad revision is:
changeset: https://hg.mozilla.org/mozilla-central/rev/b2b72530f293
user: Brian Hackett
date: Wed Feb 22 06:45:05 2017 -0700
summary: Bug 1341326 - Set stack limit and stack size properly for helper threads, r=jandem.

Brian, is bug 1341326 a likely regressor?

Blocks: 1341326
Flags: needinfo?(bhackett1024)

(In reply to Gary Kwong [:gkw] [:nth10sd] - clearing bugmail from comment #9)

autobisectjs shows this is probably related to the following changeset:

The first bad revision is:
changeset: https://hg.mozilla.org/mozilla-central/rev/b2b72530f293
user: Brian Hackett
date: Wed Feb 22 06:45:05 2017 -0700
summary: Bug 1341326 - Set stack limit and stack size properly for helper threads, r=jandem.

Brian, is bug 1341326 a likely regressor?

No, that bug just sets the stack limit for helper threads.

Flags: needinfo?(bhackett1024)

Bulk change for all regression bugs with status-firefox67 as 'fix-optional' to be marked 'affected' for status-firefox68.

Whiteboard: [jsbugmon:update] → [jsbugmon:update,ignore]
JSBugMon: The testcase found in this bug no longer reproduces (tried revision 3d5cd10cb1b2).

Hi Steven, should we close this one?

Flags: needinfo?(sdetar)

Bulk change to wontfix for 68 (P3+ carryover with needinfo).

Decoder: Based on comment 12 above, can we close this bug?

Flags: needinfo?(sdetar) → needinfo?(choller)

No, this bug is still active. It is likely that the testcase in comment 0 broke due to some unrelated change that affected stack size. But the problem still exists.

Flags: needinfo?(choller)
Whiteboard: [jsbugmon:update,ignore] → [bugmon:confirmed]
BugMon: Bug remains reproducible on 884162af76f5
Flags: needinfo?(choller)
Bugmon Analysis:
Unable to reproduce bug on mozilla-central 20200312154222-e230c3be11df.
The bug appears to have been fixed in the following build range:
> Start: df3fda80bbeb71d4e1b10e8d205d740b2f959d9e (20190325155230)
> End: 3d5cd10cb1b20c1f83a189fbeb2e22f470bac0ec (20190325155340)
> Pushlog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=df3fda80bbeb71d4e1b10e8d205d740b2f959d9e&tochange=3d5cd10cb1b20c1f83a189fbeb2e22f470bac0ec
Removing bugmon keyword as no further action possible.
Please review the bug and re-add the keyword for further analysis.
Keywords: bugmon
This is an automated crash issue comment:

Summary: Crash [@ js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace)]
Build version: mozilla-central revision 20200310-c7766d0b4a12
Build flags: (buildFlags not available)
Runtime options: --fuzzing-safe --cpu-count=2 --ion-offthread-compile=off --baseline-eager --ion-gvn=off

Testcase:

    See attachment.

Backtrace:

    received signal SIGSEGV, Segmentation fault.
    [Switching to Thread 0x7ffff5dff700 (LWP 2335)]
    0x0000555555f17cec in js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace) ()
    #0  0x0000555555f17cec in js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace) ()
    #1  0x0000555555f17d97 in js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace) ()
    #2  0x0000555555f17d97 in js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace) ()
    #3  0x0000555555f17d97 in js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace) ()
    #4  0x0000555555f17d97 in js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace) ()
    #5  0x0000555555f17d97 in js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace) ()
    #6  0x0000555555f17d97 in js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace) ()
    #7  0x0000555555f17d97 in js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace) ()
    #8  0x0000555555f17d97 in js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace) ()
    [...]
    #127 0x0000555555f17d97 in js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace) ()
    rax	0x7ffff56e5248	140737311035976
    rbx	0x7ffff56e52c0	140737311036096
    rcx	0x180000	1572864
    rdx	0x17f040	1568832
    rsi	0x0	0
    rdi	0x7ffff56e51d0	140737311035856
    rbp	0x7ffff5c03020	140737316401184
    rsp	0x7ffff5c03000	140737316401152
    r8	0x6900	26880
    r9	0x0	0
    r10	0x0	0
    r11	0x200	512
    r12	0x7ffff56e51d0	140737311035856
    r13	0x7ffff56e5248	140737311035976
    r14	0x0	0
    r15	0x0	0
    rip	0x555555f17cec <js::jit::ExtractLinearSum(js::jit::MDefinition*, js::jit::MathSpace)+12>
    => 0x555555f17cec <_ZN2js3jit16ExtractLinearSumEPNS0_11MDefinitionENS0_9MathSpaceE+12>:	push   %rbx
       0x555555f17ced <_ZN2js3jit16ExtractLinearSumEPNS0_11MDefinitionENS0_9MathSpaceE+13>:	push   %rax

This bug is still active and causing me some trouble in conjunction with WebAssembly. Apparently it triggers easily when large WebAssembly bodies are combined with --ion-gvn=off.

Jan, should we stop fuzzing with --ion-gvn=off for now?

Flags: needinfo?(choller) → needinfo?(jdemooij)
Duplicate of this bug: 1648988
Crash Signature: [@ js::jit::ExtractLinearSum] → [@ js::jit::ExtractLinearSum] [@ ??]

(In reply to Christian Holler (:decoder) from comment #21)

This bug is still active and causing me some trouble in conjunction with WebAssembly. Apparently it triggers easily when large WebAssembly bodies are combined with --ion-gvn=off.

Jan, should we stop fuzzing with --ion-gvn=off for now?

I'd prefer not doing that - turning off GVN makes certain code paths much harder to reach and it also means code will start depending on GVN always running which isn't great.

I'll try to look at this soon. Maybe we could manually limit the recursion depth as a simple short-term fix for fuzzing.

Crash Signature: [@ js::jit::ExtractLinearSum] [@ ??] → [@ js::jit::ExtractLinearSum] [@ ??]

Hi; I'm trying to see if I can get my head around fixes to pick one up.

Are there unit tests for https://searchfox.org/mozilla-central/source/js/src/jit/IonAnalysis.cpp#3311 which could help me get an understanding of its current behavior?

Is SimpleLinearSum only looking if it can replace the MDefinition with a constant?

If SimpleLinearSum.constant == 0 does that mean it was not possible to extract the SimpleLinearSum, so the term will be evaluated?

iain, you're flagged as mentor on this one, cf comment 24.

Flags: needinfo?(iireland)

Hi Brent! Thanks for taking a look at this!

There are no unit tests for most of SM. It's hard to split the interesting parts JS execution up into tidy testable units. Once you have a fix, you can test it using the jit-test suite. Instructions for building and testing SM are here. (I would recommend testing an optimized debug build.)

Is SimpleLinearSum only looking if it can replace the MDefinition with a constant?

Close. ExtractLinearSum is trying to replace the MDefinition with the sum of an MDefinition and a constant. So if you had an MDefinition representing 7 + foo() + 5 - 3, ExtractLinearSum would simplify it to foo() + 9. SimpleLinearSum is the type we use to represent this result. It contains an arbitrary MDefinition (an MCall of foo in this example) and a constant (9 in this case).

If SimpleLinearSum.constant == 0 does that mean it was not possible to extract the SimpleLinearSum, so the term will be evaluated?

The term will always be evaluated unless the entire expression can be folded to a constant, in which case term will be nullptr.

The problem in this bug happens when we are calling ExtractLinearSum on an MDefinition that represents an extremely large sequence of 1 + 1 + 1 + .... Because ExtractLinearSum calls itself recursively, we end up overflowing the stack and crashing. Normally this kind of expression is cleaned up before we get to this point by another optimization called Global Value Numbering. However, turning off GVN can be useful while fuzzing SpiderMonkey. The goal of this bug is to avoid crashing here.

There are a few ways we could do this. The easiest might be to keep track of the depth of the recursion and just return ins + 0 when that limit is exceeded. (The most complete fix would be to change this from a recursive algorithm to an iterative one using a work queue, but that might be overkill for a problem that only occurs when we are tying one hand behind our back.)

Does that help? Let me know if you have any additional questions!

Flags: needinfo?(jdemooij)
Flags: needinfo?(iireland)
Flags: needinfo?(brentgracey)

I would like to work on this if nobody else is actively doing so.

keep track of the depth of the recursion and just return ins + 0 when that limit is exceeded.

I see that it crashes for a depth of 65536. I'm not sure what the expected available stack size is for this environment, so can you suggest an appropriate depth limit I should set?

Flags: needinfo?(iireland)

Hi! Thanks for working on this.

You are correct that the stack size varies from environment to environment. Instead of hardcoding a limit, we have checkRecursionLimit to test whether we are close to overflowing the stack. Because you will be handling the potential overflow immediately, the specific variant you probably want to use here is checkRecursionLimitDontReport, which won't report a stack overflow error to the JSContext.

Let me know if you need any more help!

Flags: needinfo?(iireland)
Flags: needinfo?(brentgracey)

Hi Iain, thanks for the help.

bool CheckRecursionLimitDontReport(JSContext* cx)

CheckRecursionLimitDontReport requires a pointer to a JSContext, but ExtractLinearSum does not take that as an input. How can we deduce the stack space and stack limits without a JSContext? Do we need to add that as a parameter to ExtractLinearSum and pass a JSContext pointer wherever we are calling it?

Flags: needinfo?(iireland)

Ha! An exceptionally good question. That'll teach me to reply without rereading ExtractLinearSum first.

Passing a JSContext* around is not really feasible here, unfortunately. Compilation can run offthread without access to a JSContext. My previous response was too clever. The real solution is probably to use a hardcoded limit, which means we have to come up with a number.

In practice, when GVN is not disabled, I would be surprised if the depth hit double digits on most code. I would try something like 100, then add a temporary MOZ_ASSERT and run the jit-tests to verify that we don't hit the limit normally. (Use mach jit-test --ion to trigger more aggressive stress-testing of the JITs.) If 100 hits the limit, bump it up to 500 or 1000. (The exact number shouldn't matter much. Real code should almost never hit this limit.)

Sorry about the previous bad advice. Good luck!

Flags: needinfo?(iireland)
Assignee: nobody → kanishk509
Status: NEW → ASSIGNED

Hey Iain,

I logged the recursion depths and found that jit-test --ion hits a max recursion depth of 33.
I've set the SAFE_RECURSION_LIMIT to 100. I've defined this constant inside the ExtractLinearSum function body, let me know if I should put it somewhere else.

Submitted the patch for review.

Flags: needinfo?(iireland)

Looks great! Stamped and landed.

Flags: needinfo?(iireland)
Pushed by iireland@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/77e417379092
Set maximum recursion limit for ExtractLinearSum. r=iain
Status: ASSIGNED → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
Target Milestone: --- → 81 Branch
You need to log in before you can comment on or make changes to this bug.