Closed Bug 1645638 Opened 7 months ago Closed 7 months ago

Perma [tier2] js\src\jit-test\tests\wasm\simd\spec\simd_f32x4_cmp.js | uncaught exception: out of memory (code 3, args "--ion-eager --ion-offthread-compile=off --more-compartments") [0.9 s]

Categories

(Core :: Javascript: WebAssembly, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla79
Tracking Status
firefox79 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: lth)

References

Details

(Keywords: intermittent-failure)

Attachments

(4 files)

Filed by: abutkovits [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=306254004&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/TRMliKxZQmG0AarGm6ZPyw/runs/0/artifacts/public/logs/live_backing.log


[task 2020-06-13T23:04:06.968Z] TEST-PASS | js\src\jit-test\tests\wasm\simd\spec\simd_f32x4_cmp.js | Success (code 0, args "") [0.9 s]
[task 2020-06-13T23:04:07.849Z] uncaught exception: out of memory
[task 2020-06-13T23:04:07.849Z]
[task 2020-06-13T23:04:07.849Z] Exit code: 3
[task 2020-06-13T23:04:07.849Z] FAIL - wasm\simd\spec\simd_f32x4_cmp.js
[task 2020-06-13T23:04:07.849Z] TEST-UNEXPECTED-FAIL | js\src\jit-test\tests\wasm\simd\spec\simd_f32x4_cmp.js | uncaught exception: out of memory (code 3, args "--ion-eager --ion-offthread-compile=off --more-compartments") [0.9 s]
[task 2020-06-13T23:04:07.849Z] INFO exit-status : 3
[task 2020-06-13T23:04:07.849Z] INFO timed-out : False
[task 2020-06-13T23:04:07.850Z] INFO stderr 2> uncaught exception: out of memory
[task 2020-06-13T23:04:07.901Z] TEST-PASS | js\src\jit-test\tests\wasm\simd\spec\simd_f32x4_cmp.js | Success (code 59, args "--ion-eager --ion-offthread-compile=off --ion-check-range-analysis --ion-extra-checks --no-sse3 --no-threads") [0.1 s]
[task 2020-06-13T23:04:08.767Z] uncaught exception: out of memory
[task 2020-06-13T23:04:08.767Z]
[task 2020-06-13T23:04:08.767Z] Exit code: 3
[task 2020-06-13T23:04:08.767Z] FAIL - wasm\simd\spec\simd_f32x4_cmp.js
[task 2020-06-13T23:04:08.767Z] TEST-UNEXPECTED-FAIL | js\src\jit-test\tests\wasm\simd\spec\simd_f32x4_cmp.js | uncaught exception: out of memory (code 3, args "--baseline-eager") [0.9 s]

Component: JavaScript Engine → Javascript: WebAssembly

This looks like a memory-starved test system -- running tests locally with --tbpl I see no particularly interesting memory spike for that test case. It does run for a while and use a bit of memory, but nothing that would cause an OOM.

this seems to be a perma failure on mozilla-central win2012 debug. This is run on an aws ec2 c5.4xlarge machine which 16 vCpu and 32G RAM. These are running windows10x64, so it should take advantage of the RAM and cpus.

:lth, are you the person to look into this? what other info can I help get?

Flags: needinfo?(lhansen)

Yeah, I have a fix for this, awaiting approval.

Flags: needinfo?(lhansen)

Bumping and taking this so as to manage it when the fix lands.

Assignee: nobody → lhansen
Severity: normal → S2
Status: NEW → ASSIGNED
Priority: P5 → P1

Actually probable dup of bug 1644554 (I got confused about pending patches), will monitor orange factor for a day or so before dupping.

See Also: → 1644554
Summary: Intermittent js\src\jit-test\tests\wasm\simd\spec\simd_f32x4_cmp.js | uncaught exception: out of memory (code 3, args "--ion-eager --ion-offthread-compile=off --more-compartments") [0.9 s] → Perma [tier2] js\src\jit-test\tests\wasm\simd\spec\simd_f32x4_cmp.js | uncaught exception: out of memory (code 3, args "--ion-eager --ion-offthread-compile=off --more-compartments") [0.9 s]

Hm, if this is still failing then it's less likely a dup, I'll attempt to repro locally on current m-c.

I can repro locally in a debug-noopt build. --ion-eager is required but the other arguments are not. --ion-eager will mean more machine code for JS. We have a limited machine code budget on x86 and the test program is huge and contains a very large number of modules; if GC is not sufficiently prompt (or if we have a leak) this may actually be an OOM.

On the other hand, the test case was never a problem on 32-bit Linux.

Indeed visual studio shows memory usage growing monotonically, hitting 200MB before we OOM out. ReportOutOfMemory is called from WasmModuleObject::construct() because CompileBuffer fails because finishCodeTier fails because ModuleSegment::create fails because ... ultimately, ProcessExecutableMemory::allocate fails because we've allocated too many pages of executable code, the static max being 2240. We then call my_LargeAllocFailCallback() in js.cpp, which runs a gc, but appears to fail to reclaim any of these bytes because the number of pages allocated remains the same. This is mysterious and worth investigating. Either we have a bad leak or the callback must force some eager sweeping or something.

Anyway, this is an actual OOM and we could avoid it simply by splitting up the test file, though I'd prefer to make the GC do what we want it to do.

Making the GC type GC_SHRINK does not help.

Indeed there are three mysteries:

  • why does the last-ditch GC not help here?
  • why is this last-ditch GC needed in the first place, when there are plainly very many dead modules and instances waiting to be reaped?
  • why is this not a problem on Linux, as the number of available executable pages is the same?

It is a problem on 32-bit Linux, it's just that nobody's looked...

So the last-ditch GC deletes a lot of Module objects but apparently no Code blobs at all. This would seem mysterious except it frees no Instance objects at all, through the entire run. That's a curious leak. WasmInstanceObject::finalize is never called.

OK, so this may be an artifact of ion-eager or it may be some kind of artifact of how JS code is GC'd. The test case has this structure (about 45000 lines of it):

test = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`(module ... (func (export "run") ...) ...)`)), {'':ins.exports})
assertEq(test.exports.run(), 1)

(where ins is just a shared instance of a module with testing functions, referenced from run.)

If I comment out the assertEq lines I no longer OOM, and indeed I can add more wasm modules ad infinitum (I stopped at 130000 lines) and the instances thus created are being reaped by the GC as desired.

With the assertEq lines present, no instances are reaped at all. Thus the conclusion is that there's compiled JS code that holds onto the instances and that code is not GCd.

Jan/Jon, is there a way I can force the collection of dead compiled JS, see previous comment? Or are all the compiled JS blobs hanging off a script node somewhere and thus undead until the script goes away too?

Flags: needinfo?(jdemooij)
Flags: needinfo?(jcoppeard)

If you run dumpHeap() at an appropriate point it might tell you what is keeping the code alive.

(In reply to Lars T Hansen [:lth] from comment #14)

Jan/Jon, is there a way I can force the collection of dead compiled JS, see previous comment? Or are all the compiled JS blobs hanging off a script node somewhere and thus undead until the script goes away too?

You could try a shrinking GC, gc(null, "shrinking").

Flags: needinfo?(jdemooij)

(In reply to Jan de Mooij [:jandem] from comment #16)

You could try a shrinking GC, gc(null, "shrinking").

Tried that (see also comment 10) but that does not help, so something live keeps things alive.

The output from dumpHeap is a little obscure to me still and of course --ion-eager makes things more confusing. More later.

Attached file fnix.js

Here's a stripped-down case. Suppose I run it like this:

dist/bin/js --ion-eager ~/moz/fnix.js | grep WebAssembly.Instance

then the output is this:

0xf5869358 B string <atom: length 20> WebAssembly.Instance
0xf5858100 B WebAssembly.Instance <no private>
0xf5858150 B WebAssembly.Instance <no private>
0xf58581a0 B WebAssembly.Instance <no private>
0xf58581f0 B WebAssembly.Instance <no private>
0xf5858240 B WebAssembly.Instance <no private>
0xf5858290 B WebAssembly.Instance <no private>
0xf58582e0 B WebAssembly.Instance <no private>
0xf5858330 B WebAssembly.Instance <no private>
0xf5858380 B WebAssembly.Instance <no private>
0xf58583d0 B WebAssembly.Instance <no private>
0xf5858420 B WebAssembly.Instance <no private>
0xf5858470 B WebAssembly.Instance <no private>
0xf58584c0 B WebAssembly.Instance <no private>
0xf5858510 B WebAssembly.Instance <no private>
0xf5858560 B WebAssembly.Instance <no private>
0xf58585b0 B WebAssembly.Instance <no private>

where those are instances that are held live somehow. If instead I run it like this:

dist/bin/js --ion-eager -i -- < ~/moz/fnix.js | grep WebAssembly.Instance

then the output is this:

js> var ins = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> var run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
0xf5869330 B string <atom: length 20> WebAssembly.Instance
0xf58580b0 B WebAssembly.Instance <no private>
0xf58581f0 B WebAssembly.Instance <no private>

where the last two lines show the remaining two live instances (referenced from ins and run), as we would expect.

That is, compiling the test case as a single script instead of as a sequence of scripts (individual statements) will result in the dead instances all being retained.

Attached file funfnix.js

Same test case, but each run = ... ; assertEq(...) pair is wrapped in a (function () { ... })();. Now values are collected as they should be. So something about the top-level script being compiled as a unit definitely hangs on to the intermediate values.

And in fnix.js, it's the act of invoking run that creates this value that is retained, because if I replace

assertEq(run.exports.run(), 1)

with

void(run.exports.run(), 1)

the problem remains, but if I go all the way to

void(run.exports.run, 1)

it disappears.

It is also not the 'ins' value -- the instance that's imported into every other instance -- that keeps the other instances alive, because the problem persists after I replace the import with a local function and remove the ins value.

I've been staring at various levels of intermediate code and machine code and nothing is an obvious problem. There is a mysterious bit of SetPropertyCache with the identifier that is or starts with "MMMM" generated every creation of of an instance that leaks, but since it's the act of invoking the function that seems to trigger the leak and since this cache entry is overwritten for every such instance that's created, that's not probably it. The size of the Ion stack frame appears to be invariant wrt to the number of instances created, too.

The best guess is still that the act of invoking an exported method on an instance somehow leaks that instance to the JSScript or JitScript or something else in its environment; that is consistent with all evidence so far.

Attached file smallfnix.js

Simpler test case (less noise). grep the output for 'WebAssembly.Instance'; there should be no output that shows object-ish values.

The conclusion here is that the JitScript holds onto IC caches for each of the calls to run, and each IC cache holds onto the individual run function, which holds onto its instance. This is really a perfect storm: A very large number of distinct call sites each holding onto a full wasm instance in a very long script without any opportunity for the GC to kick in and discard the JitScript even in principle.

I'm going to split the test case into two files, and also split the corresponding f64x2 case, which fails in the same way on my system.

The combination of the following factors causes the test cases to OOM
on 32-bit systems due to a lack of object code space:

  • --ion-eager
  • very many distinct call sites with one IC cache per call site
  • one distinct callee per call site that holds onto a distinct wasm
    module
  • each wasm module claiming at least one wasm page's worth of code
  • IC caches that hold onto the individual functions that are called

Just split the test cases to work around this rare problem.

See Also: → 1646321
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/da5a73bd070b
split very large test cases. r=rhunt
Status: ASSIGNED → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla79
Flags: needinfo?(jcoppeard)
You need to log in before you can comment on or make changes to this bug.