1118397 - 26MB regression on AreWeSlimYet on 12/14

Assignee

Description

•

10 years ago

Eric pointed this out to me. It seems to be due to bug 1052793. It looks like both explicit and resident regressed, for "After TP5, tabs closed [+30s]" and "After TP5, tabs closed [+30s, forced GC]". The other numbers don't seem to have moved much, so I'm not sure what is going on. The force GC numbers being higher are odd, as my patch should not have changed there behavior.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 1

•

10 years ago

Does the forced GC end up running GC (and CC?) only once?

Eric Rahm [:erahm]

Comment 2

•

10 years ago

(In reply to Olli Pettay [:smaug] from comment #1) > Does the forced GC end up running GC (and CC?) only once? The AWSY code is here: https://github.com/mozilla/areweslimyet I'll have to dig in deeper to track down where we actually trigger the GC, my guess is we only do it once.

Andrew McCreight [:mccr8]

Assignee

Comment 3

•

10 years ago

It looks like it runs GC+CC 50 times: https://github.com/mozilla/areweslimyet/search?utf8=%E2%9C%93&q=doFullGC It seems odd that that line is almost identical to the one where we just wait 30 seconds, though.

Andrew McCreight [:mccr8]

Assignee

Comment 4

•

10 years ago

I think there's a bug with the GC functions that AWSY calls. They don't prepare for a full GC, so according to Terrence this means they don't collect any zones. Which would explain that problem...

Andrew McCreight [:mccr8]

Assignee

Comment 5

•

10 years ago

(In reply to Andrew McCreight [:mccr8] from comment #4) > I think there's a bug with the GC functions that AWSY calls. They don't > prepare for a full GC, so according to Terrence this means they don't > collect any zones. Which would explain that problem... Oops, I am wrong. These functions all correctly call PrepareForFullGC.

Andrew McCreight [:mccr8]

Assignee

Comment 6

•

10 years ago

There's a 24MB increase in js-main-runtime/gc-heap/unused-arenas after my patch, and a 24MB decrease in decommitted/js-non-window/gc-heap/decommitted-arenas. So it looks like we're not shrinking when idle properly. Additionally, there is a 2.41MB increase in images/content/raster/unused, but I guess that could be noise.

Nicholas Nethercote [inactive]

Updated

•

10 years ago

Whiteboard: [MemShrink] → [MemShrink:P1]

Andrew McCreight [:mccr8]

Assignee

Comment 7

•

10 years ago

Terrence, can you think of some reason my patch to make us collect all zones less often would make us fail to decommit less arenas? This test opens a bunch of sites in tabs, 5 at a time, closing and opening new tabs, then it closes all of the tabs and waits 30 seconds. For some reason, at that point, we haven't decommitted the arenas, whereas before we did. This even persists after running the GC a bunch of times.

Flags: needinfo?(terrence)

Terrence Cole [:terrence]

Comment 8

•

10 years ago

The code to do this is so insane that I honestly have no idea. The way it works is: at the end of GC we start the background helper thread to do sweeping. This starts with helperState.shrinkFlag == false, so will not do decommit. At some point, we get the "ShrinkBuffers" call from gecko. This locks the GC and checks if the background thread is still running. If we've already finished background sweeping (presumably with shrinkFlag == false and no decommit), it kicks off a new sweep thread with shrinkFlag == true and presumably succeeds in decommitting. If, on the other hand, the background thread is still running, things get weird fast. First, the foreground thread sets the non-volatile, non-atomic helperState.shrinkFlag to true. This, naturally, is hella racy. Whoever implemented this seems to have been aware of that, so goes to some trouble in the background thread to... I'm not sure -- certainly nothing that improves the situation, but it certainly looks complicated! We read the shrink flag to the stack before testing it and resetting it, and we loop on the shared memory so that we'll retry if we set it between the read and the end of the loop. Maybe some of this helped in the era before multi-core? Of course the real race is with the thread state changing, which is not protected at all, so.... It seems that we really do want to more or less unconditionally decommit, but we're just failing to because of incompetence. I've got a patch to rip out this fiasco and just do the decommit about half written. Give me a few more hours of uninterrupted time and I think we can solve this problem at the source.

Assignee: continuation → terrence

Flags: needinfo?(terrence)

Andrew McCreight [:mccr8]

Assignee

Comment 9

•

10 years ago

ni? myself to check that this fixes the spike on AWSY, plus submit a pull request to mark the regression and fix on AWSY.

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Assignee

Comment 10

•

10 years ago

By "this" I mean the backout of bug 1052793.

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Assignee

Updated

•

10 years ago

Flags: needinfo?(continuation)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 11

•

10 years ago

> It seems that we really do want to more or less unconditionally decommit, but we're just > failing to because of incompetence. As far as I can tell, we do decommit unconditionally when the flag is set. shrinkFlag is always accessed with the GC lock held, so it's not racy. We set shrinkFlag on the main thread and then start background sweeping thread if it's not already going. The background sweeper eventually checks the flag and shrinks if it's set. It clears the flag once it has done the shrink.

Terrence Cole [:terrence]

Comment 12

•

10 years ago

You are correct, the AutoUnlockGC is in the block above. This is probably actually related to: https://dxr.mozilla.org/mozilla-central/source/dom/base/nsJSEnvironment.cpp#2394

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 13

•

10 years ago

Yeah. It looks like I added that in bug 730853. I don't remember why. We should probably just change it.

Terrence Cole [:terrence]

Comment 14

•

10 years ago

Attached patch always_shrink_on_last_slice-v0.diff — Details — Splinter Review

So thinking about what we actually need, I think we can actually get away with the following almost-trivial patch. No try run because infra is down again. :-(

Attachment #8546206 - Flags: review?(jcoppeard)

Terrence Cole [:terrence]

Comment 15

•

10 years ago

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=83a6e11ec2be

Jon Coppeard (:jonco)

Comment 16

•

10 years ago

Comment on attachment 8546206 [details] [diff] [review] always_shrink_on_last_slice-v0.diff Review of attachment 8546206 [details] [diff] [review]: ----------------------------------------------------------------- Looks good!

Attachment #8546206 - Flags: review?(jcoppeard) → review+

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 17

•

10 years ago

I suspect this will regress splay and maybe other octane benchmarks. This patch effectively removes the chunk cache since we'll always be throwing away unused chunks at the end of every GC. Why not just fix the code in nsJSEnvironment?

Terrence Cole [:terrence]

Comment 18

•

10 years ago

(In reply to Bill McCloskey (:billm) from comment #17) > I suspect this will regress splay and maybe other octane benchmarks. This > patch effectively removes the chunk cache since we'll always be throwing > away unused chunks at the end of every GC. Good point! We appear to consistently be 500-1000 points lower with this patch applied. > Why not just fix the code in > nsJSEnvironment? We have more detailed information about our heap usage in SpiderMonkey, so we should be able to make better decisions about how we retain memory there. I still strongly believe that the right solution is to make us try harder to retain the right amount of memory, rather than playing games with shrink-buffers and a random delay. I was hoping to get there incrementally, but this is all so entangled that I'm not sure it's possible. In any case, we've backed out now so we can close this bug. I'll open a new bug to track an actual fix.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Andrew McCreight [:mccr8]

Assignee

Comment 19

•

10 years ago

I'm going to leave this open until I can verify the regression is fixed on AWSY, but I can take it from here.

Assignee: terrence → continuation

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Andrew McCreight [:mccr8]

Assignee

Comment 20

•

10 years ago

I confirmed the backout fixed the regression on AWSY. I still need to update the marker for it, but this bug can be closed for now.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → WORKSFORME

Andrew McCreight [:mccr8]

Assignee

Comment 21

•

10 years ago

I submitted a pull request for an annotation on AWSY.

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Assignee

Comment 22

•

10 years ago

I added annotations to AWSY for this regression and fix.

Whiteboard: [MemShrink:P1] → [MemShrink:P1][awsy+][awsy-]

Andrew McCreight [:mccr8]

Assignee

Updated

•

10 years ago

Blocks: AWSY