Closed Bug 849526 Opened 7 years ago Closed 7 years ago

IonMonkey: Modified splay benchmark keeps eating memory, OOMs

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla23

People

(Reporter: jandem, Assigned: jandem)

References

Details

Attachments

(2 files)

Attached file Testcase
Attached is a standalone version of Octane-splay, modified to run a fixed number of iterations.

With --no-ion, memory usage stays at 324 MB and the script finishes without problems.

With --no-jm, memory usage quickly grows to 1.7 GB, then we OOM:

js(61955,0xac3832c0) malloc: *** mmap(size=16777216) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

This is with a 32-bit build on OS X, inbound revision d073b1e6ba69.

I think it's related to IonCode somehow, we have the same problem on the baseline compiler branch with --no-ion. For instance, shouldn't IonLinker::newCode call updateMallocCounter?

This is probably responsible for the large spikes on Octane on AWFY...
Most of the memory allocation in splay should be for GC things, so the malloc counter shouldn't have much effect.

What should be happening is that allocation should cause the gcIsNeeded flag to be set on the compartment and it should also trigger the operation callback. Then we should GC inside the operation callback.

Maybe the operation callback isn't getting run from Ion/baseline?
I debugged this a little, and we are GCing. The leaked memory is part of the GC heap. It seems to be kept alive by the conservative stack scanner. If I comment out MarkConservativeStackRoots, then memory usage is stable.

We have code to skip portions of the stack that are controlled by Ion. I wonder if it's working.
Blocks: 848152
(In reply to Bill McCloskey (:billm) from comment #2)
> I debugged this a little, and we are GCing. The leaked memory is part of the
> GC heap. It seems to be kept alive by the conservative stack scanner. If I
> comment out MarkConservativeStackRoots, then memory usage is stable.

Nice find, thanks! I tried a 64-bit build and memory is stable with Ion, so the conservative stack scanner causing this makes sense. And indeed, Octane is a lot more stable on 64-bit on AWFY..
OK, I had some fun with the conservative stack scanner to track this down. There's a single "root" keeping most of the GC heap alive.

From SplayTree.prototype.findMax:

  var current = opt_startNode || this.root_;

This JSOP_OR initially runs in the interpreter:

BEGIN_CASE(JSOP_OR)
{
    bool cond = ToBoolean(regs.sp[-1]);
    if (cond == true) {
...

Clang inlines ToBoolean and spills regs.sp[-1] (opt_startNode) to the stack. Then we enter Ion or baseline and the spilled Value is still on the stack. The conservative stack scanner marks it, keeping a huge part of the tree alive.

To confirm this analysis, I added the following line to SplayTree.prototype.findMax, and we start leaking memory on 64-bit too:

    if (typeof foo === "undefined") foo = opt_startNode;

I'm not sure how to fix this. We could return out of js::Interpret before entering baseline, but that may be complicated. Since we are making good progress with exact rooting, I'm inclined to find a workaround to keep Clang from spilling the Value in the meantime. Any thoughts/ideas?
Wow, great job in tracking that down.  Seriously obscure.
Is it possible to do something like:

    BEGIN_CASE(JSOP_OR)
    {
        Value val = regs.sp[-1]; // [1]
        bool cond = ToBoolean(val); // [2]
        val = Value(); // [3]
        if (cond == true) {
            ...

The idea here is to force the compiler to put the Value onto the stack, use the value from the stack then clear it immediately after use.

I know that this is ugly, but it should fix the Clang issue. It will need a comment associated with it (ideally pointing at the relevant Clang bug).

NOTE: There are likely to be other places similar to this where Clang is doing the same inline + spill a Value behaviour -- esp. if implicit conversions to Value objects are performed.

Clearly, this is a Clang optimisation bug as an implicitly created object as a result of a parameter call should be destroyed after the call, irrespective of whether the call is inlined or not.
(In reply to Reece H. Dunn from comment #6)
> Clearly, this is a Clang optimisation bug as an implicitly created object as
> a result of a parameter call should be destroyed after the call,
> irrespective of whether the call is inlined or not.

It's undefined when the memory/register/whatever associated with something goes unused again.  When the destructor's called likely *is* defined -- but Value deliberately has no destructor.  What clang's doing here is not a bug per spec.

Given we're off in undefined/implementation-defined land, I don't think there's an obvious fix.  I agree with hacking it til it works in practice.
Is this responsible for the spikes on Octane on AWFY? If so, we should probably just land a hack like the one in comment 6.

It's a bit weird that the spikes in CodeLoad, Gameboy, PdfJS and Splay have a (varying) negative correlation to those in Box2D, though. Could Box2D in any way gain by us collecting less garbage?
Attached patch WorkaroundSplinter Review
This patch wins at least 1000 points on Octane-splay and makes Octane a lot more stable for me.

It's hard to say for sure whether this fixes the problem 100% of the time though, but I'd like to take this workaround (until we set fire to the conservative scanner) and see how it does on AWFY.
Assignee: general → jdemooij
Status: NEW → ASSIGNED
Attachment #741877 - Flags: review?(dvander)
Comment on attachment 741877 [details] [diff] [review]
Workaround

Review of attachment 741877 [details] [diff] [review]:
-----------------------------------------------------------------

Hah. Ok.
Attachment #741877 - Flags: review?(dvander) → review+
https://hg.mozilla.org/mozilla-central/rev/a98823877629
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
Duplicate of this bug: 848152
You need to log in before you can comment on or make changes to this bug.