Last Comment Bug 699668 - long garbage collector sweep times due to LifoAlloc destructor in JSCompartment::sweep
: long garbage collector sweep times due to LifoAlloc destructor in JSCompartme...
Status: RESOLVED FIXED
[qa-]
: perf
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: x86 Mac OS X
: -- normal with 1 vote (vote)
: mozilla11
Assigned To: Bill McCloskey (:billm)
:
Mentors:
Depends on:
Blocks: 684039
  Show dependency treegraph
 
Reported: 2011-11-03 17:39 PDT by Andrew McCreight [:mccr8]
Modified: 2012-02-01 14:00 PST (History)
14 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed


Attachments
simple fix (1.05 KB, patch)
2011-11-16 10:32 PST, Bill McCloskey (:billm)
n.nethercote: review+
akeybl: approval‑mozilla‑beta+
Details | Diff | Splinter Review

Description Andrew McCreight [:mccr8] 2011-11-03 17:39:49 PDT
On my browser, sweep times are longer than Bill said is expected.  Sometimes they are longer than mark times.

Some are bad like this:

GC(T+938.9) Type:Glob, Total:233.3, Wait:1.3, Mark:84.0, Sweep:147.0, FinObj:2.1, FinStr:0.4, FinScr:0.7, FinShp:2.3, DisCod:2.4, DisAnl:134.2, XPCnct:1.3, Destry:0.0, End:2.3, +Chu:0, -Chu:0, Reason:  API

GC(T+1161.0) Type:Glob, Total:273.6, Wait:10.0, Mark:87.9, Sweep:139.9, FinObj:4.1, FinStr:0.6, FinScr:2.8, FinShp:30.2, DisCod:2.3, DisAnl:94.0, XPCnct:1.7, Destry:0.3, End:37.3, +Chu:0, -Chu:0, Reason:  API

Some are not as bad:

GC(T+808.4) Type:Glob, Total:192.5, Wait:0.1, Mark:86.4, Sweep:57.8, FinObj:2.2, FinStr:0.4, FinScr:0.7, FinShp:2.3, DisCod:1.9, DisAnl:45.2, XPCnct:1.3, Destry:0.0, End:49.4, +Chu:0, -Chu:3, Reason:  API

GC(T+1165.1) Type:Glob, Total:133.0, Wait:0.1, Mark:90.0, Sweep:42.6, FinObj:0.9, FinStr:0.1, FinScr:0.7, FinShp:2.3, DisCod:1.8, DisAnl:31.7, XPCnct:1.2, Destry:0.1, End:1.3, +Chu:0, -Chu:0, Reason:  API

GC(T+1190.0) Type:Glob, Total:115.8, Wait:0.1, Mark:86.0, Sweep:29.0, FinObj:2.0, FinStr:0.3, FinScr:0.6, FinShp:2.2, DisCod:1.7, DisAnl:17.2, XPCnct:1.4, Destry:0.0, End:1.8, +Chu:0, -Chu:9, Reason:  API

I don't know if this is relevant or not, but the sweep times seemed worse with TI disabled.

This is with gmail, twitter, zimbra, facebook, bugzilla, and a few other misc sites, with a little light browsing, like CNN.

Addons:
about:telemetry 0.6 true
Add-on Compatibility Reporter 0.9.2 true
Flashblock 1.5.15.1 true
pdf.js PDF Reader 0.1 true 
Places Maintenance 1.0 true
Test Pilot 1.2 true
Mozilla Labs: Prospector - AwesomeBar HD 14 false
Comment 1 Gregor Wagner [:gwagner] 2011-11-03 17:46:33 PDT
What is DisAnl?
Comment 2 Bill McCloskey (:billm) 2011-11-03 17:48:54 PDT
That's added in bug 699298. It times part of JSCompartment::sweep (the part that discards TI analysis data).

I guess it would be nice to allow long names in the error console and short names in the MOZ_GCTIMER column headings.
Comment 3 Gregor Wagner [:gwagner] 2011-11-03 17:56:53 PDT
(In reply to Bill McCloskey (:billm) from comment #2)
> That's added in bug 699298. It times part of JSCompartment::sweep (the part
> that discards TI analysis data).
>

Oh ok the one I should review. Will do it tomorrow.
 
> I guess it would be nice to allow long names in the error console and short
> names in the MOZ_GCTIMER column headings.

That's the place where we point people with long GC pauses right? I guess it would save us some time if we make it more verbose.

What's so expensive in the DisAnl part? Do we touch every reachable object?
Comment 4 Brian Hackett (:bhackett) 2011-11-03 19:12:32 PDT
The DisAnl part is sweeping type information in scripts and type objects.  The times here are way higher than I saw back in July, and the DisAnl times when TI is disabled should be about zero, so something weird seems to be going on.
Comment 5 Andrew McCreight [:mccr8] 2011-11-03 19:15:35 PDT
My home computer (Win7) doesn't seem to be showing this problem.  I don't have as much stuff open, but it is synced with my office computer.
Comment 6 Bill McCloskey (:billm) 2011-11-04 15:18:21 PDT
We tracked this down to the LifoAlloc destructor in JSCompartment::sweep. TI allocates memory from the LifoAlloc in 4K chunks. A bunch of them get freed in the destructor.

We increased the block size from 4K to 1MB and the sweep times went back to normal. So I think we need to find a better strategy for sizing the blocks. Maybe some kind of doubling or something.

Nick, Chris said that he had talked about this with you in the LifoAlloc bugs. Could you guys figure out a solution? We need something that gets rid of these >100ms pauses but doesn't bloat us. And if we do use a little more memory, we need to make sure that the slop is reported in about:memory.
Comment 7 Nicholas Nethercote [:njn] 2011-11-05 12:38:53 PDT
I introduced LifoAlloc::sizeOf() in bug 684800 which accounts for unused space and space used by headers, which LifoAlloc::used() doesn't, and I converted the typeLifoAlloc to use sizeOf().  (I probably should have removed used(), too.)  So the about:memory side is covered.

The simplest thing would be to try a range of different sizes for TYPE_LIFO_ALLOC_PRIMARY_CHUNK_SIZE.  4KB is pretty tiny given that "analysis-temporary" is very often multiple MB.  1MB sounds a little high... maybe something like 128KB or 256KB reduces the sweep times acceptably?
Comment 8 Andrew McCreight [:mccr8] 2011-11-08 11:23:28 PST
Is anybody looking into this?  I'm currently seeing sweep times that average 250ms on my browser (vs 97ms for marking).
Comment 9 Nicholas Nethercote [:njn] 2011-11-08 11:54:11 PST
mccr8, since you are seeing the problem manifest, can you try changing TYPE_LIFO_ALLOC_PRIMARY_CHUNK_SIZE to 128KB and see if that helps?
Comment 10 Andrew McCreight [:mccr8] 2011-11-08 11:55:39 PST
Will do.
Comment 11 Bill McCloskey (:billm) 2011-11-08 11:58:02 PST
Why don't we do some kind of exponential growth here? We could use a small growth factor to ensure we don't waste too much space. It would be much more general than picking a particular size and crossing our fingers.
Comment 12 Andrew McCreight [:mccr8] 2011-11-08 13:08:17 PST
256KB is works fine, with mark times averaging 90ms and sweep times of 26ms.  I  thought you said 256 instead of 128, Nick, for some reason.  I'll try 128KB now...
Comment 13 Andrew McCreight [:mccr8] 2011-11-08 14:36:13 PST
At 128KB, I have 175ms mark times and 48ms sweep times.  Longer mark times are probably just different usage.
Comment 14 Nicholas Nethercote [:njn] 2011-11-08 16:50:23 PST
Is 26ms reasonable for sweep times?  If so, we could just up to 256KB and land.  If not, we could think about the exponential growth thing, though that would require some changes to LifoAlloc.
Comment 15 Bill McCloskey (:billm) 2011-11-16 10:32:42 PST
Created attachment 574938 [details] [diff] [review]
simple fix

Maybe it's better just to fix this before we forget about it.
Comment 17 Marco Bonardo [::mak] 2011-11-17 03:04:21 PST
https://hg.mozilla.org/mozilla-central/rev/aa05d607c95e
Comment 18 Marco Bonardo [::mak] 2011-11-17 05:11:48 PST
tree-management reports a 1.5% TraceMalloc MaxHeap on WinNT5.2 increase from this patch, is this expected effect of enlarging the arenas?
Comment 19 Bill McCloskey (:billm) 2011-11-17 11:38:29 PST
No, this was not expected. I backed out the change.

https://hg.mozilla.org/integration/mozilla-inbound/rev/dd9749c483ce

It kind of makes sense that this would cause a regression. Each compartment has one of these allocators, so in the worst case we're wasting ~256K per compartment. That's horrible. I think we're going to have to go with an exponential growth scheme.

Do you have time to do this, Chris? We really do need it. Gavin was seeing GC times of over a second caused by this bug.
Comment 20 Nicholas Nethercote [:njn] 2011-11-17 11:48:16 PST
I was about to say the following when I collided with Bill's comment:

"It's quite plausible.  But the allocations are fairly short-lived -- this structure is purged on every GC -- so I'm not too worried about it.  The improvements in sweep times are definite, any slowdowns from slightly more memory being used are much murkier in comparison."

As long as we are allocating larger Lifo chunks, be they all the same size or exponentially increasing, it's possible that the last chunk we allocate will not get filled up.

And an exponential approach is probably worse.  With the 256KB chunks we waste at most 256KB;  with an exponential scheme... 4 + 8 + 16 + 32 + 64 + 128 + 256 == 508, which means we'll potentially waste more if the size of the Lifo allocations exceed 508KB, which is really common -- they're commonly multiple MB.

I'd recommend re-landing the existing patch.  The cost of a 256KB, mostly-untouched, short-lived allocation isn't that high.  Bill, if you're still uncomfortable, maybe you could do more measurements with 128KB or 64KB chunks, see if they still avoid the long sweep times enough to satisfy you.
Comment 21 Bill McCloskey (:billm) 2011-11-17 12:07:02 PST
I'm not sure we (or I) understand the problem enough to re-land the patch. My fear is that there are a bunch of compartments with very little TI data, where nearly all of the 256K is wasted. This might not be happening--maybe if we use any space, we use a lot. But it seems foolish to waste memory if we don't have to.

Also, we don't know if the memory is short-lived. Some of the data in the arena is preserved across GCs. If a tiny bit is preserved in each GC, then we're potentially wasting a lot of memory permanently.

Finally, an exponential growth scheme doesn't have to use doubling. It just needs to ensure that the next block allocated is some fraction of the total size currently allocated. So that allows us to control the amount of wasted memory as a fraction. Doubling uses 1.0, but we could use 0.25 or something like that.

I don't mean to be argumentative here. I just want to make sure we understand the workload before proceeding.
Comment 22 Randell Jesup [:jesup] 2011-11-17 12:10:20 PST
I'm tempted to say re-land, and see what difference it makes in other cases.  If you want to go with exponential, you don't have to let it grow beyond 256KB - it can grow (at whatever rate) up to a cap.
Comment 23 Nicholas Nethercote [:njn] 2011-11-17 13:56:07 PST
I was thinking that the typeLifoAlloc was global, but it's per-compartment, so the potential for waste is higher than I thought.
Comment 24 Nicholas Nethercote [:njn] 2011-11-17 13:58:03 PST
(In reply to Bill McCloskey (:billm) from comment #21)
> Doubling uses 1.0, but we could use 0.25 or something
> like that.

The good thing about doubling is that you avoid slop bytes due to awkward request sizes.  (We had one case in the JM assembler where the growth factor was 1.5x and we were wasting heaps of memory, so I changed it to 2x.)
Comment 25 Bill McCloskey (:billm) 2011-11-17 14:05:50 PST
I did some measurements with and without the patch. I used a pretty simple workload with 8 tabs (nytimes, gmail, etc.). I let them load and then sit for a while, doing a few GCs, before I measured. So this is steady-state behavior.

Without the patch, we waste about 31KB and use 2763 blocks. With the patch we waste about 3MB and use 55 blocks. If I reduce the block size to 128K, then we waste about 1.5MB and use 98 blocks.

I played around with a bunch of different allocation schemes, but none of them seem to do much better than 128KB blocks. I had been assuming that most of the space was wasted by very small arenas, but it's actually the compartments with lots of type data that waste the most.

So if we re-land, I think we definitely should use 128K blocks. However, is 1.5MB waste for 8 tabs acceptable?

One thing I noticed is that the amount of type data per compartment seems to be pretty steady. Could we add some kind of trim command that we could call after a GC? It would realloc the last block to be whatever size it needs to be.
Comment 26 Nicholas Nethercote [:njn] 2011-11-17 15:11:30 PST
> Without the patch, we waste about 31KB and use 2763 blocks. With the patch
> we waste about 3MB and use 55 blocks. If I reduce the block size to 128K,
> then we waste about 1.5MB and use 98 blocks.

With 256KB blocks:
- total = 55 * 256KB = 14080KB
- waste = 3072KB
- useful = 14080KB - 3072KB = 11008KB --> 78.2%

With 128KB blocks:
- total = 98 * 128KB = 12544KB
- waste = 1536KB
- useful = 12544KB - 1536KB = 11008KB --> 87.8%

With 4KB blocks:
- total mem = 2763 * 4KB = 11052KB
- waste = 31KB
- useful = 11052KB - 31KB = 11021KB --> 99.7%


> I played around with a bunch of different allocation schemes, but none of
> them seem to do much better than 128KB blocks. I had been assuming that most
> of the space was wasted by very small arenas, but it's actually the
> compartments with lots of type data that waste the most.

Each typeLifoAlloc will have only one block that is not full, right?  So the average waste will should be N * B / 2, where N is the number of compartments and B is the block size.  Do you know roughly what N was for your test?  Hmm, I should be able to figure it out:  if W = N * B / 2, then N = W / B * 2...

With 256KB blocks:  N = 3072KB / 256KB * 2 = 24.0
With 128KB blocks:  N = 1536KB / 128KB * 2 = 24.0
With   4KB blocks:  N =   31KB /   4KB * 2 = 15.5

Seems reasonable for 8 tabs.  Not sure about that 15.5, but this assumes an average waste of B/2 per compartment so it's approximate anyway.


> So if we re-land, I think we definitely should use 128K blocks. However, is
> 1.5MB waste for 8 tabs acceptable?

So that's average waste of 64KB per compartment.  That's the kind of question that's really hard to answer, but everyone looks at me when asking it :/  Given that you have definite slow-downs without the patch, I'd say yes.  However, this is the kind of thing that will kill us with compartment-per-global.
Comment 27 Bill McCloskey (:billm) 2011-11-18 15:03:32 PST
OK, I just relanded this for 128K. I guess if it starts to be a problem, the waste will show up in about:memory and we can fix it then.

https://hg.mozilla.org/integration/mozilla-inbound/rev/487d3d5dc47f
Comment 28 Ed Morley [:emorley] 2011-11-19 05:11:34 PST
https://hg.mozilla.org/mozilla-central/rev/487d3d5dc47f
Comment 29 Bill McCloskey (:billm) 2011-12-21 16:31:32 PST
Comment on attachment 574938 [details] [diff] [review]
simple fix

This is a fairly serious regression in GC pause time for Mac users. It started happening in Firefox 10, and it was fixed on 11, so I think we should try to get this in beta.

It's very unlikely that this patch will make us crash or have any other correctness problems. However, it will cause a small increase in memory usage. This is a trade-off to avoid the long pauses (which may be hundreds of milliseconds).
Comment 30 Alex Keybl [:akeybl] 2011-12-26 10:45:26 PST
Comment on attachment 574938 [details] [diff] [review]
simple fix

[Triage Comment]
Given our recent CC/GC investigations and how early we are in the cycle, let's take this on beta.
Comment 31 Bill McCloskey (:billm) 2011-12-26 15:49:14 PST
https://hg.mozilla.org/releases/mozilla-beta/rev/3003c5a4c3a3
Comment 32 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-12-28 14:25:35 PST
Is this something QA can verify without access to Andrew's profile?
Comment 33 Bill McCloskey (:billm) 2011-12-28 14:26:53 PST
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #32)
> Is this something QA can verify without access to Andrew's profile?

No.
Comment 34 Danial Horton 2012-01-02 05:24:28 PST
Has anyone noticed an after effect of this patch being lower overall memory use?

Note You need to log in before you can comment on or make changes to this bug.