On my browser, sweep times are longer than Bill said is expected. Sometimes they are longer than mark times.
Some are bad like this:
GC(T+938.9) Type:Glob, Total:233.3, Wait:1.3, Mark:84.0, Sweep:147.0, FinObj:2.1, FinStr:0.4, FinScr:0.7, FinShp:2.3, DisCod:2.4, DisAnl:134.2, XPCnct:1.3, Destry:0.0, End:2.3, +Chu:0, -Chu:0, Reason: API
GC(T+1161.0) Type:Glob, Total:273.6, Wait:10.0, Mark:87.9, Sweep:139.9, FinObj:4.1, FinStr:0.6, FinScr:2.8, FinShp:30.2, DisCod:2.3, DisAnl:94.0, XPCnct:1.7, Destry:0.3, End:37.3, +Chu:0, -Chu:0, Reason: API
Some are not as bad:
GC(T+808.4) Type:Glob, Total:192.5, Wait:0.1, Mark:86.4, Sweep:57.8, FinObj:2.2, FinStr:0.4, FinScr:0.7, FinShp:2.3, DisCod:1.9, DisAnl:45.2, XPCnct:1.3, Destry:0.0, End:49.4, +Chu:0, -Chu:3, Reason: API
GC(T+1165.1) Type:Glob, Total:133.0, Wait:0.1, Mark:90.0, Sweep:42.6, FinObj:0.9, FinStr:0.1, FinScr:0.7, FinShp:2.3, DisCod:1.8, DisAnl:31.7, XPCnct:1.2, Destry:0.1, End:1.3, +Chu:0, -Chu:0, Reason: API
GC(T+1190.0) Type:Glob, Total:115.8, Wait:0.1, Mark:86.0, Sweep:29.0, FinObj:2.0, FinStr:0.3, FinScr:0.6, FinShp:2.2, DisCod:1.7, DisAnl:17.2, XPCnct:1.4, Destry:0.0, End:1.8, +Chu:0, -Chu:9, Reason: API
I don't know if this is relevant or not, but the sweep times seemed worse with TI disabled.
This is with gmail, twitter, zimbra, facebook, bugzilla, and a few other misc sites, with a little light browsing, like CNN.
about:telemetry 0.6 true
Add-on Compatibility Reporter 0.9.2 true
Flashblock 126.96.36.199 true
pdf.js PDF Reader 0.1 true
Places Maintenance 1.0 true
Test Pilot 1.2 true
Mozilla Labs: Prospector - AwesomeBar HD 14 false
What is DisAnl?
That's added in bug 699298. It times part of JSCompartment::sweep (the part that discards TI analysis data).
I guess it would be nice to allow long names in the error console and short names in the MOZ_GCTIMER column headings.
(In reply to Bill McCloskey (:billm) from comment #2)
> That's added in bug 699298. It times part of JSCompartment::sweep (the part
> that discards TI analysis data).
Oh ok the one I should review. Will do it tomorrow.
> I guess it would be nice to allow long names in the error console and short
> names in the MOZ_GCTIMER column headings.
That's the place where we point people with long GC pauses right? I guess it would save us some time if we make it more verbose.
What's so expensive in the DisAnl part? Do we touch every reachable object?
The DisAnl part is sweeping type information in scripts and type objects. The times here are way higher than I saw back in July, and the DisAnl times when TI is disabled should be about zero, so something weird seems to be going on.
My home computer (Win7) doesn't seem to be showing this problem. I don't have as much stuff open, but it is synced with my office computer.
We tracked this down to the LifoAlloc destructor in JSCompartment::sweep. TI allocates memory from the LifoAlloc in 4K chunks. A bunch of them get freed in the destructor.
We increased the block size from 4K to 1MB and the sweep times went back to normal. So I think we need to find a better strategy for sizing the blocks. Maybe some kind of doubling or something.
Nick, Chris said that he had talked about this with you in the LifoAlloc bugs. Could you guys figure out a solution? We need something that gets rid of these >100ms pauses but doesn't bloat us. And if we do use a little more memory, we need to make sure that the slop is reported in about:memory.
I introduced LifoAlloc::sizeOf() in bug 684800 which accounts for unused space and space used by headers, which LifoAlloc::used() doesn't, and I converted the typeLifoAlloc to use sizeOf(). (I probably should have removed used(), too.) So the about:memory side is covered.
The simplest thing would be to try a range of different sizes for TYPE_LIFO_ALLOC_PRIMARY_CHUNK_SIZE. 4KB is pretty tiny given that "analysis-temporary" is very often multiple MB. 1MB sounds a little high... maybe something like 128KB or 256KB reduces the sweep times acceptably?
Is anybody looking into this? I'm currently seeing sweep times that average 250ms on my browser (vs 97ms for marking).
mccr8, since you are seeing the problem manifest, can you try changing TYPE_LIFO_ALLOC_PRIMARY_CHUNK_SIZE to 128KB and see if that helps?
Why don't we do some kind of exponential growth here? We could use a small growth factor to ensure we don't waste too much space. It would be much more general than picking a particular size and crossing our fingers.
256KB is works fine, with mark times averaging 90ms and sweep times of 26ms. I thought you said 256 instead of 128, Nick, for some reason. I'll try 128KB now...
At 128KB, I have 175ms mark times and 48ms sweep times. Longer mark times are probably just different usage.
Is 26ms reasonable for sweep times? If so, we could just up to 256KB and land. If not, we could think about the exponential growth thing, though that would require some changes to LifoAlloc.
Created attachment 574938 [details] [diff] [review]
Maybe it's better just to fix this before we forget about it.
tree-management reports a 1.5% TraceMalloc MaxHeap on WinNT5.2 increase from this patch, is this expected effect of enlarging the arenas?
No, this was not expected. I backed out the change.
It kind of makes sense that this would cause a regression. Each compartment has one of these allocators, so in the worst case we're wasting ~256K per compartment. That's horrible. I think we're going to have to go with an exponential growth scheme.
Do you have time to do this, Chris? We really do need it. Gavin was seeing GC times of over a second caused by this bug.
I was about to say the following when I collided with Bill's comment:
"It's quite plausible. But the allocations are fairly short-lived -- this structure is purged on every GC -- so I'm not too worried about it. The improvements in sweep times are definite, any slowdowns from slightly more memory being used are much murkier in comparison."
As long as we are allocating larger Lifo chunks, be they all the same size or exponentially increasing, it's possible that the last chunk we allocate will not get filled up.
And an exponential approach is probably worse. With the 256KB chunks we waste at most 256KB; with an exponential scheme... 4 + 8 + 16 + 32 + 64 + 128 + 256 == 508, which means we'll potentially waste more if the size of the Lifo allocations exceed 508KB, which is really common -- they're commonly multiple MB.
I'd recommend re-landing the existing patch. The cost of a 256KB, mostly-untouched, short-lived allocation isn't that high. Bill, if you're still uncomfortable, maybe you could do more measurements with 128KB or 64KB chunks, see if they still avoid the long sweep times enough to satisfy you.
I'm not sure we (or I) understand the problem enough to re-land the patch. My fear is that there are a bunch of compartments with very little TI data, where nearly all of the 256K is wasted. This might not be happening--maybe if we use any space, we use a lot. But it seems foolish to waste memory if we don't have to.
Also, we don't know if the memory is short-lived. Some of the data in the arena is preserved across GCs. If a tiny bit is preserved in each GC, then we're potentially wasting a lot of memory permanently.
Finally, an exponential growth scheme doesn't have to use doubling. It just needs to ensure that the next block allocated is some fraction of the total size currently allocated. So that allows us to control the amount of wasted memory as a fraction. Doubling uses 1.0, but we could use 0.25 or something like that.
I don't mean to be argumentative here. I just want to make sure we understand the workload before proceeding.
I'm tempted to say re-land, and see what difference it makes in other cases. If you want to go with exponential, you don't have to let it grow beyond 256KB - it can grow (at whatever rate) up to a cap.
I was thinking that the typeLifoAlloc was global, but it's per-compartment, so the potential for waste is higher than I thought.
(In reply to Bill McCloskey (:billm) from comment #21)
> Doubling uses 1.0, but we could use 0.25 or something
> like that.
The good thing about doubling is that you avoid slop bytes due to awkward request sizes. (We had one case in the JM assembler where the growth factor was 1.5x and we were wasting heaps of memory, so I changed it to 2x.)
I did some measurements with and without the patch. I used a pretty simple workload with 8 tabs (nytimes, gmail, etc.). I let them load and then sit for a while, doing a few GCs, before I measured. So this is steady-state behavior.
Without the patch, we waste about 31KB and use 2763 blocks. With the patch we waste about 3MB and use 55 blocks. If I reduce the block size to 128K, then we waste about 1.5MB and use 98 blocks.
I played around with a bunch of different allocation schemes, but none of them seem to do much better than 128KB blocks. I had been assuming that most of the space was wasted by very small arenas, but it's actually the compartments with lots of type data that waste the most.
So if we re-land, I think we definitely should use 128K blocks. However, is 1.5MB waste for 8 tabs acceptable?
One thing I noticed is that the amount of type data per compartment seems to be pretty steady. Could we add some kind of trim command that we could call after a GC? It would realloc the last block to be whatever size it needs to be.
> Without the patch, we waste about 31KB and use 2763 blocks. With the patch
> we waste about 3MB and use 55 blocks. If I reduce the block size to 128K,
> then we waste about 1.5MB and use 98 blocks.
With 256KB blocks:
- total = 55 * 256KB = 14080KB
- waste = 3072KB
- useful = 14080KB - 3072KB = 11008KB --> 78.2%
With 128KB blocks:
- total = 98 * 128KB = 12544KB
- waste = 1536KB
- useful = 12544KB - 1536KB = 11008KB --> 87.8%
With 4KB blocks:
- total mem = 2763 * 4KB = 11052KB
- waste = 31KB
- useful = 11052KB - 31KB = 11021KB --> 99.7%
> I played around with a bunch of different allocation schemes, but none of
> them seem to do much better than 128KB blocks. I had been assuming that most
> of the space was wasted by very small arenas, but it's actually the
> compartments with lots of type data that waste the most.
Each typeLifoAlloc will have only one block that is not full, right? So the average waste will should be N * B / 2, where N is the number of compartments and B is the block size. Do you know roughly what N was for your test? Hmm, I should be able to figure it out: if W = N * B / 2, then N = W / B * 2...
With 256KB blocks: N = 3072KB / 256KB * 2 = 24.0
With 128KB blocks: N = 1536KB / 128KB * 2 = 24.0
With 4KB blocks: N = 31KB / 4KB * 2 = 15.5
Seems reasonable for 8 tabs. Not sure about that 15.5, but this assumes an average waste of B/2 per compartment so it's approximate anyway.
> So if we re-land, I think we definitely should use 128K blocks. However, is
> 1.5MB waste for 8 tabs acceptable?
So that's average waste of 64KB per compartment. That's the kind of question that's really hard to answer, but everyone looks at me when asking it :/ Given that you have definite slow-downs without the patch, I'd say yes. However, this is the kind of thing that will kill us with compartment-per-global.
OK, I just relanded this for 128K. I guess if it starts to be a problem, the waste will show up in about:memory and we can fix it then.
Comment on attachment 574938 [details] [diff] [review]
This is a fairly serious regression in GC pause time for Mac users. It started happening in Firefox 10, and it was fixed on 11, so I think we should try to get this in beta.
It's very unlikely that this patch will make us crash or have any other correctness problems. However, it will cause a small increase in memory usage. This is a trade-off to avoid the long pauses (which may be hundreds of milliseconds).
Comment on attachment 574938 [details] [diff] [review]
Given our recent CC/GC investigations and how early we are in the cycle, let's take this on beta.
Is this something QA can verify without access to Andrew's profile?
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #32)
> Is this something QA can verify without access to Andrew's profile?
Has anyone noticed an after effect of this patch being lower overall memory use?