Closed Bug 831588 Opened 11 years ago Closed 11 years ago

Don't measure decommitted GC arenas in the "explicit" tree

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla22
Tracking Status
firefox21 --- wontfix
firefox22 --- fixed
b2g18 --- fixed
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- wontfix

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Whiteboard: [MemShrink])

Attachments

(1 file)

Patch coming shortly.
This patch moves the "gc-heap/decommitted-arenas" reports out of the "explicit"
tree.  jlebar's been wanting this for a while, and it's needed for bug 819769.

Here's what the sub-tree in the "Other measurements" section looks like.

> 2,027,520 B (100.0%) -- decommitted
> ├──1,404,928 B (69.29%) -- workers
> │  ├──1,404,928 B (69.29%) ── workers()/worker(resource://gre/modules/osfile/osfile_async_worker.js, 0x7f533e31c800)/gc-heap/decommitted-arenas
> │  └──────────0 B (00.00%) ── workers(mozilla.org)/worker(backgroundSpaceInvaders_worker.js, 0x7f5340b31000)/gc-heap/decommitted-arenas
> └────622,592 B (30.71%) ── js-non-window/gc-heap/decommitted-arenas

These entries were all previously in the "explicit" tree.

The downside of this patch is that it slightly slows down the "explicit"
measurement obtained by telemetry.  Instead of just getting the entire GC heap
size in a single lookup, we also have to iterate over each chunk and then
possibly each arena (256 per chunk) to subtract the decommitted arenas.
Determining if an arena is decommitted requires a bitfield access, so if we had
a 1 GiB JS heap, we'd need (in the worst case) 256*1024 bitfield accesses.
(That's the worst case because we can do faster in the case where no arenas in
a chunk are decommitted.)

Also note that this will affect AWSY's explicit graph, as we typically have
about 13--14 MiB of decommitted arenas at the end (for all of the lines,
AFAICT).
Attachment #703109 - Flags: review?(terrence)
Attachment #703109 - Flags: review?(justin.lebar+bug)
Comment on attachment 703109 [details] [diff] [review]
Don't measure decommitted GC arenas in the "explicit" tree.

Review of attachment 703109 [details] [diff] [review]:
-----------------------------------------------------------------

Makes sense.
Attachment #703109 - Flags: review?(terrence) → review+
> The downside of this patch is that it slightly slows down the "explicit"
> measurement obtained by telemetry.

I was just bitten by slow telemetry in bug 789975, so now I'm a bit afraid of this.

We should make sure we have telemetry for how long this takes.  How long does it take on your machine to collect reports before/after this change?

If we wanted to keep telemetry the same, we could place the "decommitted" fields under the explicit tree and subtract them out somewhere in other-measurements.  That would be unfortunate from a visual point of view, since "explicit" is privileged as the topmost number in about:memory, but it'd be a start.

Alternatively, we could change GetExplicitNonHeap into GetExplicitAndDecommittedNonHeap, and then telemetry could report explicit + decommitted.

But I guess that's all premature optimization until we have an idea of how much cost we're talking about here.  I'm afraid it might be a non-trivial amount of memory traffic.
Comment on attachment 703109 [details] [diff] [review]
Don't measure decommitted GC arenas in the "explicit" tree.

>+    bool allClear() const {
>+        for (size_t i = 0; i < numSlots; i++)
>+            if (map[i])
>+                return false;
>+        return true;

FWIW we could use SIMD on x86 to make this much faster.  Not sure about ARM.

>diff --git a/js/src/jsmemorymetrics.cpp b/js/src/jsmemorymetrics.cpp
>--- a/js/src/jsmemorymetrics.cpp
>+++ b/js/src/jsmemorymetrics.cpp

> static void
>+DecommittedArenasChunkCallback(JSRuntime *rt, void *data, gc::Chunk *chunk)
>+{
>+    // This case is common and fast to check.  Do it first.
>+    if (chunk->decommittedArenas.allClear())
>+        return;
>+
>+    size_t n = 0;
>+    for (size_t i = 0; i < gc::ArenasPerChunk; i++)
>+        if (chunk->decommittedArenas.get(i))
>+            n += gc::ArenaSize;

Hmm...I don't see how the allClear() check makes this any faster.

If no bit is set in the array, we iterate over the whole decommittedArenas
array and take one branch per element.

If at least one bit is set, we iterate over part of the decommittedArenas array
(taking one branch per element), then iterate over the whole array and again
take one branch per element.

So doesn't the strategy of iterating once over the whole array (without
checking allClear() first) dominate?

On both ARM and x86, I'm pretty sure we could speed this up using SIMD, if we
needed to.  (We just need to sum up the whole bit array.)

r=me, but let's make sure we have the right telemetry so we have a chance of noticing if this gets out of hand.
Attachment #703109 - Flags: review?(justin.lebar+bug) → review+
|map| is a bitmask, so |allClear()| is actually checking 8 arenas at once. It should also be on the same page as the |Chunk| footer, so this check /should/ be pretty fast in practice. If we are worried about the perf of |allClear()|, rather than breaking out SIMD I'd cast |map| to a |const uintptr_t*| and do the check per-word rather than per-byte.
> |map| is a bitmask, so |allClear()| is actually checking 8 arenas at once.

Ah, I see.  But map is actually a uintptr_t[], so isn't it checking a whole word at once, as written?
> map is actually a uintptr_t[], so isn't it checking a whole
> word at once, as written?

Yes.  There are 256 arenas per chunk, so on 32-bit platforms there are 8 comparisons in a successful allClear() call, and on 64-bit platforms there are 4 comparisons.

x86 actually has bit counting operations that are perfect for this kind of thing, but they're pretty fiddly to implement in a cross-platform fashion.  (Nanojit had them.)  And it would only be notably faster in the not-all-clear case.

> r=me, but let's make sure we have the right telemetry so we have a chance of
> noticing if this gets out of hand.

What would that telemetry look like?
> What would that telemetry look like?

Oh, I missed comment 3 above.

One big difference between this and bug 789975 is that there are no system calls involved here.

But I'll do some measurements.
(In reply to Justin Lebar [:jlebar] from comment #6)
> > |map| is a bitmask, so |allClear()| is actually checking 8 arenas at once.
> 
> Ah, I see.  But map is actually a uintptr_t[], so isn't it checking a whole
> word at once, as written?

Right you are! I should have checked before posting.
Maybe we should just land this.  I was probably being way too pessimistic assuming this could have a perf impact.

This would make my life a lot easier on b2g; subtracting out the decommitted memory is a pain.
I'll wait until zones (bug 759585) has landed -- this will conflict a bit, and that bug is more important so I don't want this bug to be an obstacle.
Depends on: Zones
https://hg.mozilla.org/mozilla-central/rev/0aafa85948db
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
Whiteboard: [MemShrink]
Comment on attachment 703109 [details] [diff] [review]
Don't measure decommitted GC arenas in the "explicit" tree.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): n/a
User impact if declined: It's much more difficult to understand memory reports without this patch.
Testing completed: On m-c for a while.
Risk to taking this patch (and alternatives if risky): This should only affect memory reports.
String or UUID changes made by this patch: none

I'd like this for b2g18, but it's not necessary for v1.0.1.
Attachment #703109 - Flags: approval-mozilla-b2g18?
Attachment #703109 - Flags: approval-mozilla-b2g18? → approval-mozilla-b2g18+
You need to log in before you can comment on or make changes to this bug.