Perma [tier2] js\src\jit-test\tests\wasm\simd\spec\simd_f32x4_cmp.js | uncaught exception: out of memory (code 3, args "--ion-eager --ion-offthread-compile=off --more-compartments") [0.9 s]
Categories
(Core :: Javascript: WebAssembly, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox79 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Assigned: lth)
References
Details
(Keywords: intermittent-failure)
Attachments
(4 files)
Filed by: abutkovits [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=306254004&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/TRMliKxZQmG0AarGm6ZPyw/runs/0/artifacts/public/logs/live_backing.log
[task 2020-06-13T23:04:06.968Z] TEST-PASS | js\src\jit-test\tests\wasm\simd\spec\simd_f32x4_cmp.js | Success (code 0, args "") [0.9 s]
[task 2020-06-13T23:04:07.849Z] uncaught exception: out of memory
[task 2020-06-13T23:04:07.849Z]
[task 2020-06-13T23:04:07.849Z] Exit code: 3
[task 2020-06-13T23:04:07.849Z] FAIL - wasm\simd\spec\simd_f32x4_cmp.js
[task 2020-06-13T23:04:07.849Z] TEST-UNEXPECTED-FAIL | js\src\jit-test\tests\wasm\simd\spec\simd_f32x4_cmp.js | uncaught exception: out of memory (code 3, args "--ion-eager --ion-offthread-compile=off --more-compartments") [0.9 s]
[task 2020-06-13T23:04:07.849Z] INFO exit-status : 3
[task 2020-06-13T23:04:07.849Z] INFO timed-out : False
[task 2020-06-13T23:04:07.850Z] INFO stderr 2> uncaught exception: out of memory
[task 2020-06-13T23:04:07.901Z] TEST-PASS | js\src\jit-test\tests\wasm\simd\spec\simd_f32x4_cmp.js | Success (code 59, args "--ion-eager --ion-offthread-compile=off --ion-check-range-analysis --ion-extra-checks --no-sse3 --no-threads") [0.1 s]
[task 2020-06-13T23:04:08.767Z] uncaught exception: out of memory
[task 2020-06-13T23:04:08.767Z]
[task 2020-06-13T23:04:08.767Z] Exit code: 3
[task 2020-06-13T23:04:08.767Z] FAIL - wasm\simd\spec\simd_f32x4_cmp.js
[task 2020-06-13T23:04:08.767Z] TEST-UNEXPECTED-FAIL | js\src\jit-test\tests\wasm\simd\spec\simd_f32x4_cmp.js | uncaught exception: out of memory (code 3, args "--baseline-eager") [0.9 s]
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•7 months ago
|
Assignee | ||
Comment 2•7 months ago
|
||
This looks like a memory-starved test system -- running tests locally with --tbpl I see no particularly interesting memory spike for that test case. It does run for a while and use a bit of memory, but nothing that would cause an OOM.
Comment 3•7 months ago
|
||
this seems to be a perma failure on mozilla-central win2012 debug. This is run on an aws ec2 c5.4xlarge machine which 16 vCpu and 32G RAM. These are running windows10x64, so it should take advantage of the RAM and cpus.
:lth, are you the person to look into this? what other info can I help get?
Assignee | ||
Comment 4•7 months ago
|
||
Yeah, I have a fix for this, awaiting approval.
Assignee | ||
Comment 5•7 months ago
|
||
Bumping and taking this so as to manage it when the fix lands.
Assignee | ||
Comment 6•7 months ago
|
||
Actually probable dup of bug 1644554 (I got confused about pending patches), will monitor orange factor for a day or so before dupping.
Updated•7 months ago
|
Assignee | ||
Comment 7•7 months ago
|
||
Hm, if this is still failing then it's less likely a dup, I'll attempt to repro locally on current m-c.
Assignee | ||
Comment 8•7 months ago
•
|
||
I can repro locally in a debug-noopt build. --ion-eager is required but the other arguments are not. --ion-eager will mean more machine code for JS. We have a limited machine code budget on x86 and the test program is huge and contains a very large number of modules; if GC is not sufficiently prompt (or if we have a leak) this may actually be an OOM.
On the other hand, the test case was never a problem on 32-bit Linux.
Assignee | ||
Comment 9•7 months ago
|
||
Indeed visual studio shows memory usage growing monotonically, hitting 200MB before we OOM out. ReportOutOfMemory is called from WasmModuleObject::construct() because CompileBuffer fails because finishCodeTier fails because ModuleSegment::create fails because ... ultimately, ProcessExecutableMemory::allocate fails because we've allocated too many pages of executable code, the static max being 2240. We then call my_LargeAllocFailCallback() in js.cpp, which runs a gc, but appears to fail to reclaim any of these bytes because the number of pages allocated remains the same. This is mysterious and worth investigating. Either we have a bad leak or the callback must force some eager sweeping or something.
Anyway, this is an actual OOM and we could avoid it simply by splitting up the test file, though I'd prefer to make the GC do what we want it to do.
Assignee | ||
Comment 10•7 months ago
|
||
Making the GC type GC_SHRINK does not help.
Indeed there are three mysteries:
- why does the last-ditch GC not help here?
- why is this last-ditch GC needed in the first place, when there are plainly very many dead modules and instances waiting to be reaped?
- why is this not a problem on Linux, as the number of available executable pages is the same?
Assignee | ||
Comment 11•7 months ago
|
||
It is a problem on 32-bit Linux, it's just that nobody's looked...
Assignee | ||
Comment 12•7 months ago
|
||
So the last-ditch GC deletes a lot of Module objects but apparently no Code blobs at all. This would seem mysterious except it frees no Instance objects at all, through the entire run. That's a curious leak. WasmInstanceObject::finalize is never called.
Assignee | ||
Comment 13•7 months ago
|
||
OK, so this may be an artifact of ion-eager or it may be some kind of artifact of how JS code is GC'd. The test case has this structure (about 45000 lines of it):
test = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`(module ... (func (export "run") ...) ...)`)), {'':ins.exports})
assertEq(test.exports.run(), 1)
(where ins
is just a shared instance of a module with testing functions, referenced from run
.)
If I comment out the assertEq lines I no longer OOM, and indeed I can add more wasm modules ad infinitum (I stopped at 130000 lines) and the instances thus created are being reaped by the GC as desired.
With the assertEq lines present, no instances are reaped at all. Thus the conclusion is that there's compiled JS code that holds onto the instances and that code is not GCd.
Assignee | ||
Comment 14•7 months ago
|
||
Jan/Jon, is there a way I can force the collection of dead compiled JS, see previous comment? Or are all the compiled JS blobs hanging off a script node somewhere and thus undead until the script goes away too?
Comment 15•7 months ago
•
|
||
If you run dumpHeap() at an appropriate point it might tell you what is keeping the code alive.
Comment 16•7 months ago
|
||
(In reply to Lars T Hansen [:lth] from comment #14)
Jan/Jon, is there a way I can force the collection of dead compiled JS, see previous comment? Or are all the compiled JS blobs hanging off a script node somewhere and thus undead until the script goes away too?
You could try a shrinking GC, gc(null, "shrinking")
.
Assignee | ||
Comment 17•7 months ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #16)
You could try a shrinking GC,
gc(null, "shrinking")
.
Tried that (see also comment 10) but that does not help, so something live keeps things alive.
The output from dumpHeap is a little obscure to me still and of course --ion-eager makes things more confusing. More later.
Assignee | ||
Comment 18•7 months ago
|
||
Here's a stripped-down case. Suppose I run it like this:
dist/bin/js --ion-eager ~/moz/fnix.js | grep WebAssembly.Instance
then the output is this:
0xf5869358 B string <atom: length 20> WebAssembly.Instance
0xf5858100 B WebAssembly.Instance <no private>
0xf5858150 B WebAssembly.Instance <no private>
0xf58581a0 B WebAssembly.Instance <no private>
0xf58581f0 B WebAssembly.Instance <no private>
0xf5858240 B WebAssembly.Instance <no private>
0xf5858290 B WebAssembly.Instance <no private>
0xf58582e0 B WebAssembly.Instance <no private>
0xf5858330 B WebAssembly.Instance <no private>
0xf5858380 B WebAssembly.Instance <no private>
0xf58583d0 B WebAssembly.Instance <no private>
0xf5858420 B WebAssembly.Instance <no private>
0xf5858470 B WebAssembly.Instance <no private>
0xf58584c0 B WebAssembly.Instance <no private>
0xf5858510 B WebAssembly.Instance <no private>
0xf5858560 B WebAssembly.Instance <no private>
0xf58585b0 B WebAssembly.Instance <no private>
where those are instances that are held live somehow. If instead I run it like this:
dist/bin/js --ion-eager -i -- < ~/moz/fnix.js | grep WebAssembly.Instance
then the output is this:
js> var ins = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> var run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
js> run = new WebAssembly.Instance(new WebAssembly.Module(wasmTextToBinary(`
0xf5869330 B string <atom: length 20> WebAssembly.Instance
0xf58580b0 B WebAssembly.Instance <no private>
0xf58581f0 B WebAssembly.Instance <no private>
where the last two lines show the remaining two live instances (referenced from ins
and run
), as we would expect.
That is, compiling the test case as a single script instead of as a sequence of scripts (individual statements) will result in the dead instances all being retained.
Assignee | ||
Comment 19•7 months ago
|
||
Same test case, but each run = ... ; assertEq(...)
pair is wrapped in a (function () { ... })();
. Now values are collected as they should be. So something about the top-level script being compiled as a unit definitely hangs on to the intermediate values.
Assignee | ||
Comment 20•7 months ago
|
||
And in fnix.js, it's the act of invoking run
that creates this value that is retained, because if I replace
assertEq(run.exports.run(), 1)
with
void(run.exports.run(), 1)
the problem remains, but if I go all the way to
void(run.exports.run, 1)
it disappears.
Assignee | ||
Comment 21•7 months ago
|
||
It is also not the 'ins' value -- the instance that's imported into every other instance -- that keeps the other instances alive, because the problem persists after I replace the import with a local function and remove the ins value.
I've been staring at various levels of intermediate code and machine code and nothing is an obvious problem. There is a mysterious bit of SetPropertyCache with the identifier that is or starts with "MMMM" generated every creation of of an instance that leaks, but since it's the act of invoking the function that seems to trigger the leak and since this cache entry is overwritten for every such instance that's created, that's not probably it. The size of the Ion stack frame appears to be invariant wrt to the number of instances created, too.
The best guess is still that the act of invoking an exported method on an instance somehow leaks that instance to the JSScript or JitScript or something else in its environment; that is consistent with all evidence so far.
Assignee | ||
Comment 22•7 months ago
|
||
Simpler test case (less noise). grep the output for 'WebAssembly.Instance'; there should be no output that shows object-ish values.
Assignee | ||
Comment 23•7 months ago
|
||
The conclusion here is that the JitScript holds onto IC caches for each of the calls to run
, and each IC cache holds onto the individual run
function, which holds onto its instance. This is really a perfect storm: A very large number of distinct call sites each holding onto a full wasm instance in a very long script without any opportunity for the GC to kick in and discard the JitScript even in principle.
I'm going to split the test case into two files, and also split the corresponding f64x2 case, which fails in the same way on my system.
Assignee | ||
Comment 24•7 months ago
|
||
The combination of the following factors causes the test cases to OOM
on 32-bit systems due to a lack of object code space:
- --ion-eager
- very many distinct call sites with one IC cache per call site
- one distinct callee per call site that holds onto a distinct wasm
module - each wasm module claiming at least one wasm page's worth of code
- IC caches that hold onto the individual functions that are called
Just split the test cases to work around this rare problem.
Comment 25•7 months ago
|
||
Pushed by lhansen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/da5a73bd070b split very large test cases. r=rhunt
Comment 26•7 months ago
|
||
bugherder |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•7 months ago
|
Description
•