Provide Minor GC performance info to the profiler

RESOLVED FIXED in Firefox 63

Status

()

RESOLVED FIXED
9 months ago
7 months ago

People

(Reporter: pbone, Assigned: pbone)

Tracking

(Blocks: 2 bugs)

Trunk
mozilla63
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox63 fixed)

Details

Attachments

(5 attachments, 4 obsolete attachments)

(Assignee)

Description

9 months ago
Provide the following information to the gecko profiler and allow it to calculate some rates based on these such as bytes tenured per second.

 + Number of objects tenured
 + Size of objects tenured (already provided)
 + Time for tenuring phases (already provided)
 + Number of objects allocated since last minor GC
   + Nursery
   + Pretenured
(Assignee)

Updated

9 months ago
Assignee: nobody → pbone
Blocks: 1471438, 1467697
Status: NEW → ASSIGNED
(Assignee)

Comment 1

9 months ago
Attachment #8991793 - Flags: review?(jcoppeard)
Comment on attachment 8991793 [details] [diff] [review]
Bug 1473213 - Track the number of tenured cells

Review of attachment 8991793 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Nursery.h
@@ +440,4 @@
>          size_t nurseryLazyCapacity = 0;
>          size_t nurseryUsedBytes = 0;
>          size_t tenuredBytes = 0;
> +        unsigned tenuredCells = 0;

Generally 'size_t' is used for counts like this.
Attachment #8991793 - Flags: review?(jcoppeard) → review+
Attachment #8993587 - Flags: review?(sphink) → review+
(Assignee)

Comment 4

8 months ago
Attachment #8994093 - Flags: review?(sphink)
Comment on attachment 8994093 [details] [diff] [review]
Bug 1473213 - Add cells allocated statistics

Review of attachment 8994093 [details] [diff] [review]:
-----------------------------------------------------------------

The biggest worry with this patch is that it adds code to the nursery allocation fast path. I would at least want some performance numbers that argue that it doesn't regress performance substantially. Though I'd also want to get jandem's gut feel as to whether this is acceptable.

::: js/src/gc/Nursery.cpp
@@ +618,5 @@
>          json.property("chunk_alloc_us", timeInChunkAlloc_, json.MICROSECONDS);
> +    json.property("cells_allocated_nursery",
> +        runtime()->gc.stats().allocsSinceMinorGCNursery());
> +    json.property("cells_allocated_tenured",
> +        runtime()->gc.stats().allocsSinceMinorGCTenured());

This stuff doesn't fit in 99 columns?

::: js/src/gc/Statistics.cpp
@@ +1043,5 @@
> +
> +void
> +Statistics::countAllocsSinceMinorGCNursery() {
> +    allocsSinceMinorGCNursery_++;
> +}

Hm. I probably would have called this Statistics::noteNurseryAllocation or similar. The "count" verb seems a bit deceptive, and also I'd rather let the stats object decide what to do with an allocation event.

Also, this should absolutely be inlined in the header file. We really really do not want a function call for this.

@@ +1047,5 @@
> +}
> +void
> +Statistics::countAllocsSinceMinorGCTenured() {
> +    allocsSinceMinorGCTenured_++;
> +}

noteTenuredAllocation?

::: js/src/gc/Statistics.h
@@ +324,5 @@
> +     * These events cannot be kept in the above array, we need to take their
> +     * address.
> +     */
> +    uint32_t allocsSinceMinorGCNursery_;
> +    uint32_t allocsSinceMinorGCTenured_;

I thought we did tenured allocations on other threads, eg the parse thread, which means this would be a data race. Perhaps we're ok with the race, if we're going to handwave that it doesn't really matter if we get the value wrong, but that should at the very least be commented carefully. And I suspect it would be better to have a per-thread counter, or skip the increments on other threads.
Attachment #8994093 - Flags: review?(sphink) → feedback?(jdemooij)
(Assignee)

Comment 6

8 months ago
(In reply to Steve Fink [:sfink] [:s:] from comment #5)
> Comment on attachment 8994093 [details] [diff] [review]
> Bug 1473213 - Add cells allocated statistics
> 
> Review of attachment 8994093 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> The biggest worry with this patch is that it adds code to the nursery
> allocation fast path. I would at least want some performance numbers that
> argue that it doesn't regress performance substantially. Though I'd also
> want to get jandem's gut feel as to whether this is acceptable.

Fair enough, I'll do some testing and NI jandem.

> ::: js/src/gc/Statistics.cpp
> @@ +1043,5 @@
> > +
> > +void
> > +Statistics::countAllocsSinceMinorGCNursery() {
> > +    allocsSinceMinorGCNursery_++;
> > +}
> 
> Hm. I probably would have called this Statistics::noteNurseryAllocation or
> similar. The "count" verb seems a bit deceptive, and also I'd rather let the
> stats object decide what to do with an allocation event.

There are other counters that use the verb count.  but I get your point, and I can change this one since it's a different pattern anyway, it's not kept in the array with the other counters.

> Also, this should absolutely be inlined in the header file. We really really
> do not want a function call for this.

Right, I should have done that.

> 
> @@ +1047,5 @@
> > +}
> > +void
> > +Statistics::countAllocsSinceMinorGCTenured() {
> > +    allocsSinceMinorGCTenured_++;
> > +}
> 
> noteTenuredAllocation?
> 
> ::: js/src/gc/Statistics.h
> @@ +324,5 @@
> > +     * These events cannot be kept in the above array, we need to take their
> > +     * address.
> > +     */
> > +    uint32_t allocsSinceMinorGCNursery_;
> > +    uint32_t allocsSinceMinorGCTenured_;
> 
> I thought we did tenured allocations on other threads, eg the parse thread,
> which means this would be a data race. Perhaps we're ok with the race, if
> we're going to handwave that it doesn't really matter if we get the value
> wrong, but that should at the very least be commented carefully. And I
> suspect it would be better to have a per-thread counter, or skip the
> increments on other threads.

I'm not all that okay with such a race, even if you and I understand that while looking at the gecko profiler.  Not everyone will be and unless we documented it in the profiler UI I'm not okay with it.

I'll investigate per-thread counters.

BTW. how is syncronisation done for the free lists?  Does each thread get a seperate free list (arena) and syncronize when moving to the next arena?

Cheers.
Flags: needinfo?(jdemooij)
(Assignee)

Comment 7

8 months ago
Oh, jandem already has a feedback? for this bug.  Removing the NI.
Flags: needinfo?(jdemooij)
(Assignee)

Comment 8

8 months ago
(In reply to Paul Bone [:pbone] from comment #6)
> BTW. how is syncronisation done for the free lists?  Does each thread get a
> seperate free list (arena) and syncronize when moving to the next arena?

NM, I see it in JSContext, and that JSContext is per-thread.
(In reply to Paul Bone [:pbone] from comment #8)
Free lists are per zone, and each zone is only accessed from one thread at a time.  The exception is the atoms zone which can have free lists per context.  JSContext::freeLists_ points at the free lists for the current zone.  If you store counters on the zone you will need to make GCRuntime::mergeRealm update the target zone counters when merging zones.
Comment on attachment 8994093 [details] [diff] [review]
Bug 1473213 - Add cells allocated statistics

Review of attachment 8994093 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Steve Fink [:sfink] [:s:] from comment #5)
> The biggest worry with this patch is that it adds code to the nursery
> allocation fast path. I would at least want some performance numbers that
> argue that it doesn't regress performance substantially. Though I'd also
> want to get jandem's gut feel as to whether this is acceptable.

These things are hard because a single add32 on x86 isn't going to *destroy* Firefox performance, but at the same time this *is* supposed to be a fast path and these things do add up. Death by a thousand cuts is a real concern when it comes to performance - Ehsan wrote about this in one of his QF newsletters and we fixed a ton of these issues last year.

Who do we expect will use this profiler feature? If it's SpiderMonkey (GC) hackers and their Gecko friends, we're adding overhead for millions of users, collecting data ~nobody will look at. At least for the JIT we have the option of only emitting instrumentation if the profiler is enabled, you could do something like this:

https://searchfox.org/mozilla-central/rev/d160ac574a5798010eda4e33e91ee628aa268a33/js/src/jit/CompileWrappers.cpp#319

The C++ code is probably less perf-sensitive if you make these functions inlinable, as suggested by sfink, so if we can't measure the perf overhead it's probably OK (I'd suggest also measuring some C++ string allocation micro-benchmarks with nursery strings enabled). We could also make this code Nightly-only if that's where it will be used 99% of the time.
Attachment #8994093 - Flags: feedback?(jdemooij)
(Assignee)

Comment 11

8 months ago
(In reply to Jon Coppeard (:jonco) from comment #9)
> (In reply to Paul Bone [:pbone] from comment #8)
> Free lists are per zone, and each zone is only accessed from one thread at a
> time.  The exception is the atoms zone which can have free lists per
> context.  JSContext::freeLists_ points at the free lists for the current
> zone.  If you store counters on the zone you will need to make
> GCRuntime::mergeRealm update the target zone counters when merging zones.

Ah, I found the other stuff but would have missed mergeRealm, thanks.
(Assignee)

Comment 12

8 months ago
(In reply to Jan de Mooij [:jandem] from comment #10)
> Comment on attachment 8994093 [details] [diff] [review]
> Bug 1473213 - Add cells allocated statistics
> 
> Review of attachment 8994093 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> (In reply to Steve Fink [:sfink] [:s:] from comment #5)
> > The biggest worry with this patch is that it adds code to the nursery
> > allocation fast path. I would at least want some performance numbers that
> > argue that it doesn't regress performance substantially. Though I'd also
> > want to get jandem's gut feel as to whether this is acceptable.
> 
> These things are hard because a single add32 on x86 isn't going to *destroy*
> Firefox performance, but at the same time this *is* supposed to be a fast
> path and these things do add up. Death by a thousand cuts is a real concern
> when it comes to performance - Ehsan wrote about this in one of his QF
> newsletters and we fixed a ton of these issues last year.
> 
> Who do we expect will use this profiler feature? If it's SpiderMonkey (GC)
> hackers and their Gecko friends, we're adding overhead for millions of
> users, collecting data ~nobody will look at. At least for the JIT we have
> the option of only emitting instrumentation if the profiler is enabled, you
> could do something like this:
> 
> https://searchfox.org/mozilla-central/rev/
> d160ac574a5798010eda4e33e91ee628aa268a33/js/src/jit/CompileWrappers.cpp#319

I hoped we'd have something like that.  I'll do that and only generate this code if the profiler is enabled.  If we need it for other situations we can worry about that later.

> The C++ code is probably less perf-sensitive if you make these functions
> inlinable, as suggested by sfink, so if we can't measure the perf overhead
> it's probably OK (I'd suggest also measuring some C++ string allocation
> micro-benchmarks with nursery strings enabled). We could also make this code
> Nightly-only if that's where it will be used 99% of the time.

I hope/expect it won't cost very much at all on these paths, I'll try to test it.
(Assignee)

Comment 13

8 months ago
Hi sfink.

I'm not 100% sure of the code in GCRuntime::minorGC.  I'm fairly sure none of the zones could be allocating at this point, because I think that anything that allocates off-thread does so with its own realm/compartiment.  Is that right?
Attachment #8994093 - Attachment is obsolete: true
Attachment #8994855 - Flags: review?(sphink)
(Assignee)

Comment 14

8 months ago
As discussed, only enable this counter for nightly builds.  And MOZ_PROFILING, which now that I think about it might not be the gecko profiler support as I had assumed for months.
Attachment #8994858 - Flags: review?(jcoppeard)
Comment on attachment 8994858 [details] [diff] [review]
Bug 1473213 - Only record allocation counts on Nightly builds

Review of attachment 8994858 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.

::: js/src/jit/MacroAssembler.cpp
@@ +877,5 @@
>          storePtr(temp, Address(result, NativeObject::offsetOfSlots()));
>      }
>  
> +#if defined(NIGHTLY_BUILD) && defined(MOZ_PROFILING)
> +    // Only burdeon the nightly population with this,

typo: 'burden', here and below
Attachment #8994858 - Flags: review?(jcoppeard) → review+
(Assignee)

Comment 16

8 months ago
There's no differences in performance acording to talos.  So I'm just waiting on sfink's review.
Comment on attachment 8994855 [details] [diff] [review]
Bug 1473213 - Add cells allocated statistics

Review of attachment 8994855 [details] [diff] [review]:
-----------------------------------------------------------------

I believe this is ok.

I probably would have preferred to control this with cx->runtime()->geckoProfiler().enabled() (stored in a JitCompilerOptions, as with jandem's link above), but I guess Nightly-only is good enough.

But for the JIT stuff, I'm going to feedback? jandem again with a question: this is an add32 of an absolute address, and it's unconditionally using a temp register for the address. I see

    inline void add32(Imm32 imm, const AbsoluteAddress& dest) DEFINED_ON(x86_shared);

How are you supposed to use that? I doubt it matters much at all since there's a temp register available, maybe a teeny tiny amount of code size, but if there's a Right Way of doing this then we probably ought to use it for the next cut & paste coder (like, me) who comes along. Or perhaps you're only intended to use that in x86/x86_64-specific code? (I'm guessing that's the case, which is why I'm marking this r+ and only f? jandem.)
Attachment #8994855 - Flags: review?(sphink)
Attachment #8994855 - Flags: review+
Attachment #8994855 - Flags: feedback?(jdemooij)
(Assignee)

Comment 19

8 months ago
(In reply to Steve Fink [:sfink] [:s:] from comment #18)
> Comment on attachment 8994855 [details] [diff] [review]
> Bug 1473213 - Add cells allocated statistics
> 
> Review of attachment 8994855 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I believe this is ok.
> 
> I probably would have preferred to control this with
> cx->runtime()->geckoProfiler().enabled() (stored in a JitCompilerOptions, as
> with jandem's link above), but I guess Nightly-only is good enough.

That was my initial plan.  I talked about it with Jon and we figured that if some code was JITed, then you enabled profiling, then the JITed code ran and didn't increment these counters, then you collected the profile.  You'd get misleading results.  Unless starting the profile were to throw out _all_ the JIT'ed code in which case you'd get a big jank or misleading profiles because code was recompiled or was running in the baseline interpreter rather than the JIT.

> But for the JIT stuff, I'm going to feedback? jandem again with a question:
> this is an add32 of an absolute address, and it's unconditionally using a
> temp register for the address. I see
> 
>     inline void add32(Imm32 imm, const AbsoluteAddress& dest)
> DEFINED_ON(x86_shared);
> 
> How are you supposed to use that? I doubt it matters much at all since
> there's a temp register available, maybe a teeny tiny amount of code size,
> but if there's a Right Way of doing this then we probably ought to use it
> for the next cut & paste coder (like, me) who comes along. Or perhaps you're
> only intended to use that in x86/x86_64-specific code? (I'm guessing that's
> the case, which is why I'm marking this r+ and only f? jandem.)

I tried to use a single instruction without the temp register.  But it only worked on x86, not on ARM or x86_64 where the address wouldn't fit in the instruction.  The address larger than 2^32 and won't fit in the displacement field of an x86_64 instruction.  I tried to special case it for each platform but found I could only do it for x86, so why bother?

I could see about using another value and an offset the way I see you have done with some string allocation.

I will take a quick look but also wait to see what jandem says.
Comment on attachment 8994855 [details] [diff] [review]
Bug 1473213 - Add cells allocated statistics

Review of attachment 8994855 [details] [diff] [review]:
-----------------------------------------------------------------

(As for the "not sure about" in the commit message, you can access all zones there because ZonesIter skips zones that are in use by helper threads. I'm not sure it's okay to access that on the atoms zone (the WithAtoms part), worth checking with jonco.)

When we enable the Gecko profiler, we ReleaseAllJITCode, so I think we could just take advantage of that:

https://searchfox.org/mozilla-central/rev/033d45ca70ff32acf04286244644d19308c359d5/js/src/vm/GeckoProfiler.cpp#103

I agree with sfink that add32 with an AbsoluteAddress is not very useful in shared code because it's so x86-specific.
Attachment #8994855 - Flags: feedback?(jdemooij)
(In reply to Jan de Mooij [:jandem] from comment #20)
> When we enable the Gecko profiler, we ReleaseAllJITCode, so I think we could
> just take advantage of that:

This may require some non-trivial MacroAssembler plumbing. Landing Nightly-only is fine with me, but let's keep an eye on AWFY/Talos just to make sure the overhead is negligible :)
(Assignee)

Comment 22

8 months ago
I was talking with sfink and jonco just now.  We also agree that landing nightly only, checking AWFY/talos will give us an idea of if this is a slowdown.  Then remove the nightly-only ifdef and enable it with the profiler in a future change.  Thanks Jandem for that pointer.
Depends on: 1479360
(Assignee)

Comment 23

8 months ago
I'm almost ready to commit this, I just landed Bug 1479360 and I want to let it settle for a couple of days and see if AWFY/Talos move due to it before I land this one.

Jonco, Do you know if I need a lock to read and update each zone in GCRuntime::minorGC?  Including the Atoms zone.  See jandem's review in comment 20.  Thanks.
Flags: needinfo?(jcoppeard)
(Assignee)

Updated

8 months ago
Blocks: 1480001
(Assignee)

Comment 24

8 months ago
The patches here can be improved with this change.
Attachment #8996925 - Flags: review?(jcoppeard)
Comment on attachment 8994855 [details] [diff] [review]
Bug 1473213 - Add cells allocated statistics

Review of attachment 8994855 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Zone.h
@@ +341,5 @@
>      js::gc::ArenaLists arenas;
>  
> +  private:
> +    // Number of allocations since the most recent minor GC for this thread.
> +    uint32_t allocsSinceMinorGCTenured_;

(This name reads strangely to me.  Can we call this tenuredAllocsSinceMinorGC if that is what is means?  Ditto for method names below.)

Sadly this would require some kind of synchronisation for the atoms zone since we can concurrently allocate from that.

One possibility is to move the allocation count to the FreeLists structure.  There is one per zone, plus helper thread contexts can have their own for the atoms zone.  You can update that one without synchronisation.  You will need to merge the count for helper thread contexts back to the one on the zone which you can do by storing the count in the ParseTask at the end of HelperThread::handleParseWorkload and merging it in GlobalHelperThreadState::finishParseTask.  

Another is to make this count atomic, but then you'll have to work out how to update this from JIT code.  It seems we have MacroAssembler::atomicEffectOp that might do this but I'm not sure this works correctly on a C++ Atomic<>.
Flags: needinfo?(jcoppeard)
(Assignee)

Comment 26

8 months ago
(In reply to Jon Coppeard (:jonco) from comment #25)
> Comment on attachment 8994855 [details] [diff] [review]
> Bug 1473213 - Add cells allocated statistics
> 
> Review of attachment 8994855 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: js/src/gc/Zone.h
> @@ +341,5 @@
> >      js::gc::ArenaLists arenas;
> >  
> > +  private:
> > +    // Number of allocations since the most recent minor GC for this thread.
> > +    uint32_t allocsSinceMinorGCTenured_;
> 
> (This name reads strangely to me.  Can we call this
> tenuredAllocsSinceMinorGC if that is what is means?  Ditto for method names
> below.)
> 
> Sadly this would require some kind of synchronisation for the atoms zone
> since we can concurrently allocate from that.
> 
> One possibility is to move the allocation count to the FreeLists structure. 
> There is one per zone, plus helper thread contexts can have their own for
> the atoms zone.  You can update that one without synchronisation.  You will
> need to merge the count for helper thread contexts back to the one on the
> zone which you can do by storing the count in the ParseTask at the end of
> HelperThread::handleParseWorkload and merging it in
> GlobalHelperThreadState::finishParseTask.

It's a bit more tedious, but I'll investigate moving this allocation count like you suggest.

> Another is to make this count atomic, but then you'll have to work out how
> to update this from JIT code.  It seems we have
> MacroAssembler::atomicEffectOp that might do this but I'm not sure this
> works correctly on a C++ Atomic<>.

I'm concerned that this will be unnecessarily slow, most of the time it's not necessary and it overly-constrain the order of memory operations (at least on x86, whose atomic instructions and fences are rather "heavy").
(Assignee)

Comment 27

8 months ago
This refactoring patch will make the next patch make a little more sense.
Attachment #8998150 - Flags: review?(sphink)
(In reply to Paul Bone [:pbone] from comment #27)
> Created attachment 8998150 [details] [diff] [review]
> Bug 1473213 (Part 3) - Refactor JSContext code for changing zones

I skimmed this quickly because I'm working in this area: maybe this patch makes more sense together with the next patch, but just looking at this part I'm a bit worried the abstraction complicates things more than that it helps.
(Assignee)

Comment 29

8 months ago
Hi jandem,

What I want to acheive is making it so there's a single place where the zone_ pointer is changed, and therefore we can change the free lists and any other state there, it's easier to see that it's correct.

Maybe it's not worth the templatey-ness.

Cheers.
(Assignee)

Comment 30

8 months ago
Hi sfink,

I know you already reviewed this, but it's changed a bit and should be re-reviewed.  I addressed the concern with updating the information in the zone without protecting it somehow.  Now the "active" counter is kept in the JSContext and added to the one in the zone when the context switches away from that zone.  Each zone's counter is also now atomic.

Thanks.
Attachment #8994855 - Attachment is obsolete: true
Attachment #8996925 - Attachment is obsolete: true
Attachment #8996925 - Flags: review?(jcoppeard)
Attachment #8998178 - Flags: review?(sphink)
(Assignee)

Comment 31

8 months ago
(In reply to Paul Bone [:pbone] from comment #30)
> Created attachment 8998178 [details] [diff] [review]
> Bug 1473213 (Part 4) - Add cells allocated statistics r?sfink
> 
> Hi sfink,
> 
> I know you already reviewed this, but it's changed a bit and should be
> re-reviewed.  I addressed the concern with updating the information in the
> zone without protecting it somehow.  Now the "active" counter is kept in the
> JSContext and added to the one in the zone when the context switches away
> from that zone.  Each zone's counter is also now atomic.
> 
> Thanks.

Oh, I forgot to add.  Now what I'm unsure about is the condition in the JIT tenured alloc code.  I'll look into that tomorrow.
The problem was that sometimes the JSContext is null, so currently it doesn't increment anything then, I'll look for why it's null and decide if that's okay not to increment a counter.
(Assignee)

Comment 32

8 months ago
Hi jonco,

I know you reviewed this already.  but I've changed it to only ifdef out those parts that either increment a counter, or put the data into the gecko profiling data.

Cheers.
Attachment #8994858 - Attachment is obsolete: true
Attachment #8998190 - Flags: review?(jcoppeard)
Attachment #8998190 - Flags: review?(jcoppeard) → review+
Comment on attachment 8998150 [details] [diff] [review]
Bug 1473213 (Part 3) - Refactor JSContext code for changing zones

Review of attachment 8998150 [details] [diff] [review]:
-----------------------------------------------------------------

I'm ok with having a separate setZone, but the template stuff seems a little much for such a minor use. I think it'd probably be clearer to pass IsAtomsZone as a regular parameter. I'm pretty sure inlining will end up generating the same code.
Attachment #8998150 - Flags: review?(sphink) → review+
One issue with JSContext::allocsThisZoneSinceMinorGC_ is: what happens if we GC while we're in a Zone and allocsThisZoneSinceMinorGC_ > 0? Is it okay to count it as part of the *next* GC?

Also, can we make Zone::tenuredAllocsSinceMinorGC_ a Relaxed atomic? Requiring the most expensive memory barriers seems a bit much for this purpose.

(Sorry for being pedantic in this bug, but I'm still somewhat concerned about the complexity/perf vs benefits trade-offs here.)
Ah ok, I see the later patch now, adding the nightly check.
Last question before I'll shut up: why do we need the counter in the JSContext* if the one in the Zone is Atomic<> anyway? Worst case we'll lose a count here and there but that's fine for profiler stats.
(In reply to Jan de Mooij [:jandem] from comment #36)
> Last question before I'll shut up: why do we need the counter in the
> JSContext* if the one in the Zone is Atomic<> anyway? Worst case we'll lose
> a count here and there but that's fine for profiler stats.

I thought that was to avoid doing atomic increments all the time and be able to keep them to a thread-owned value on JSContext, only doing the atomic updates when switching zones. That also lets the jit code update a regular integer instead of an atomic one.
(In reply to Steve Fink [:sfink] [:s:] from comment #37)
> I thought that was to avoid doing atomic increments all the time and be able
> to keep them to a thread-owned value on JSContext, only doing the atomic
> updates when switching zones. That also lets the jit code update a regular
> integer instead of an atomic one.

I'm dumb, that's a good point. I still wonder if a Relaxed atomic is good enough for this purpose - JIT code already accesses Relaxed atomics anyway (JSContext::interruptBits_ is one).
Comment on attachment 8998178 [details] [diff] [review]
Bug 1473213 (Part 4) - Add cells allocated statistics r?sfink

Review of attachment 8998178 [details] [diff] [review]:
-----------------------------------------------------------------

I *think* this looks good now. Is there a bug yet to switch this to only do stuff (especially the JIT part) when profiling is active? I don't want that to get lost.

Also, it does seem like a Relaxed atomic would be appropriate here, even if zone switches aren't that common -- it's a good example for the next person to copy from when creating a counter. :-)

::: js/src/gc/Statistics.h
@@ +340,5 @@
> +     * These events cannot be kept in the above array, we need to take their
> +     * address.
> +     */
> +    uint32_t allocsSinceMinorGCNursery_;
> +    uint32_t allocsSinceMinorGCTenured_;

Hm, I still think these and the methods would be more readable as nurseryAllocsSinceMinorGC_ and tenuredAllocsSinceMinorGC_. If you want to group them, then

  struct {
    uint32_t nursery;
    uint32_t tenured;
  } allocsSinceMinorGC_;

would be fine too.
Attachment #8998178 - Flags: review?(sphink) → review+
(Assignee)

Comment 40

8 months ago
(In reply to Jan de Mooij [:jandem] from comment #34)
> One issue with JSContext::allocsThisZoneSinceMinorGC_ is: what happens if we
> GC while we're in a Zone and allocsThisZoneSinceMinorGC_ > 0? Is it okay to
> count it as part of the *next* GC?

Actually my head hit the pillow last night and then I rememberd this problem.  It's not really okay since the profiler will show you percentage of cells tenured, that won't make sense if we don't count the allocations properly..  it would be possible to say 200% of cells were tenured etc.

> Also, can we make Zone::tenuredAllocsSinceMinorGC_ a Relaxed atomic?
> Requiring the most expensive memory barriers seems a bit much for this
> purpose.

I wasn't worried because it didn't execute often, but Relaxed is a good idea here.  Reading it may miss some values but probably not enough to worry about.
(Assignee)

Comment 41

8 months ago
(In reply to Steve Fink [:sfink] [:s:] from comment #37)
> (In reply to Jan de Mooij [:jandem] from comment #36)
> > Last question before I'll shut up: why do we need the counter in the
> > JSContext* if the one in the Zone is Atomic<> anyway? Worst case we'll lose
> > a count here and there but that's fine for profiler stats.
> 
> I thought that was to avoid doing atomic increments all the time and be able
> to keep them to a thread-owned value on JSContext, only doing the atomic
> updates when switching zones. That also lets the jit code update a regular
> integer instead of an atomic one.

That's exactly right.  I don't want that overhead per-allocation.
(Assignee)

Comment 42

8 months ago
(In reply to Steve Fink [:sfink] [:s:] from comment #39)
> Comment on attachment 8998178 [details] [diff] [review]
> Bug 1473213 (Part 4) - Add cells allocated statistics r?sfink
> 
> Review of attachment 8998178 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I *think* this looks good now. Is there a bug yet to switch this to only do
> stuff (especially the JIT part) when profiling is active? I don't want that
> to get lost.

Already done Bug 1480001.

> Also, it does seem like a Relaxed atomic would be appropriate here, even if
> zone switches aren't that common -- it's a good example for the next person
> to copy from when creating a counter. :-)
> 
> ::: js/src/gc/Statistics.h
> @@ +340,5 @@
> > +     * These events cannot be kept in the above array, we need to take their
> > +     * address.
> > +     */
> > +    uint32_t allocsSinceMinorGCNursery_;
> > +    uint32_t allocsSinceMinorGCTenured_;
> 
> Hm, I still think these and the methods would be more readable as
> nurseryAllocsSinceMinorGC_ and tenuredAllocsSinceMinorGC_. If you want to
> group them, then
> 
>   struct {
>     uint32_t nursery;
>     uint32_t tenured;
>   } allocsSinceMinorGC_;
> 
> would be fine too.

I'll do that.  it was other symbols in a different class I renamed after your last review.

Comment 43

8 months ago
Pushed by pbone@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8a10fc261acb
(Part 1) - Track the number of tenured cells r=jonco
https://hg.mozilla.org/integration/mozilla-inbound/rev/e40d822d886b
(Part 2) - add a comment about an invariant required by the JIT r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/96e57dd562b8
(Part 3) - Refactor JSContext code for changing zones r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/129c9ab66f59
(Part 4) - Add cells allocated statistics r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/29f9f3ab9b41
(Part 5) - Only record allocation counts on Nightly builds r=jonco

Comment 44

8 months ago
Backout by btara@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8de32e539a2c
Backed out 5 changesets for devtools/client/debugger/new/test/mochitest/browser_dbg_rr_breakpoints-01.js failures
Comment on attachment 8998178 [details] [diff] [review]
Bug 1473213 (Part 4) - Add cells allocated statistics r?sfink

Review of attachment 8998178 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/MacroAssembler.cpp
@@ +832,5 @@
>  
>      bind(&success);
> +
> +    if (GetJitContext()->cx) {
> +        uint32_t* countAddress = GetJitContext()->cx->addressOfTenuredAllocCount();

I'm not sure this is correct.

The context's current zone when the JIT code is compiled is not necessarily the same as when the JIT code is executed.
(In reply to Jon Coppeard (:jonco) from comment #46)
> I'm not sure this is correct.
> 
> The context's current zone when the JIT code is compiled is not necessarily
> the same as when the JIT code is executed.

Nope, I'm wrong.  Jan says that JIT code can change realm to a different same-compartment realm, but not zone. 

So this works as is.
My question was more: would it make sense to use the Relaxed atomic on the zone *directly*, so we don't need the JSContext bookkeeping machinery? Worst-case two threads can race on the atoms zone, but it's just the atoms zone and it's fine for the counter to be off by a few. Am I missing something?
(In reply to Jon Coppeard (:jonco) from comment #46)
> > +    if (GetJitContext()->cx) {
> > +        uint32_t* countAddress = GetJitContext()->cx->addressOfTenuredAllocCount();
> 
> I'm not sure this is correct.

One unrelated problem with this is that cx is nullptr when compiling off-thread (should be the case for most Ion compilations), so we will not collect any data when running in Ion.

We could add a new method to CompileRuntime and then call that here, see also CompileRuntime::mainContextPtr().
(Assignee)

Comment 50

8 months ago
(In reply to Jan de Mooij [:jandem] from comment #48)
> My question was more: would it make sense to use the Relaxed atomic on the
> zone *directly*, so we don't need the JSContext bookkeeping machinery?
> Worst-case two threads can race on the atoms zone, but it's just the atoms
> zone and it's fine for the counter to be off by a few. Am I missing
> something?

I don't think so.

At least on x86 & x86-64 an uncontended atomic operation still has a larger cost than a non-atomic operation.  Although this is for tenured allocation which is not so much a "fast path" as nursery allocation.  I'm afraid I'm a little hand-wavy about this, since I don't know the specifics, if you do then I'd be happy to listen.
(Assignee)

Comment 51

8 months ago
(In reply to Jan de Mooij [:jandem] from comment #49)
> (In reply to Jon Coppeard (:jonco) from comment #46)
> > > +    if (GetJitContext()->cx) {
> > > +        uint32_t* countAddress = GetJitContext()->cx->addressOfTenuredAllocCount();
> > 
> > I'm not sure this is correct.
> 
> One unrelated problem with this is that cx is nullptr when compiling
> off-thread (should be the case for most Ion compilations), so we will not
> collect any data when running in Ion.
> 
> We could add a new method to CompileRuntime and then call that here, see
> also CompileRuntime::mainContextPtr().

I'd like to do that, would that be a problem for web workers?  I imagine they can't use the main context?
(In reply to Paul Bone [:pbone] from comment #51)
> I'd like to do that, would that be a problem for web workers?  I imagine
> they can't use the main context?

I think you got your answer on IRC, but just to put it on record: web workers have their own runtime and are not a problem here. We never compile code on the "main" runtime and run it on a worker; they're almost completely separate worlds.
(Assignee)

Updated

7 months ago
Depends on: 1482785
(Assignee)

Comment 53

7 months ago
(In reply to Steve Fink [:sfink] [:s:] from comment #52)
> (In reply to Paul Bone [:pbone] from comment #51)
> > I'd like to do that, would that be a problem for web workers?  I imagine
> > they can't use the main context?
> 
> I think you got your answer on IRC, but just to put it on record: web
> workers have their own runtime and are not a problem here. We never compile
> code on the "main" runtime and run it on a worker; they're almost completely
> separate worlds.

Oh, I'm sorry, I don't remember the anwser.  Okay, I'll just point at the main JSContext then.
No longer depends on: 1482785
(In reply to Paul Bone [:pbone] from comment #50)
> At least on x86 & x86-64 an uncontended atomic operation still has a larger
> cost than a non-atomic operation.  Although this is for tenured allocation
> which is not so much a "fast path" as nursery allocation.  I'm afraid I'm a
> little hand-wavy about this, since I don't know the specifics, if you do
> then I'd be happy to listen.

But it's *not* an atomic operation if you use the Relaxed policy.

Please just use the Zone counter and let's remove the one on the context, I think that removes a lot of complexity and then we also don't need to make realm/zone entering/leaving more complicated.
(Assignee)

Comment 55

7 months ago
(In reply to Jan de Mooij [:jandem] from comment #54)
> (In reply to Paul Bone [:pbone] from comment #50)
> > At least on x86 & x86-64 an uncontended atomic operation still has a larger
> > cost than a non-atomic operation.  Although this is for tenured allocation
> > which is not so much a "fast path" as nursery allocation.  I'm afraid I'm a
> > little hand-wavy about this, since I don't know the specifics, if you do
> > then I'd be happy to listen.
> 
> But it's *not* an atomic operation if you use the Relaxed policy.

It is, Relaxed is about ordering it with respect to other memory operations on the same thread.  It's still atomic, it just has no specific guanteed ordering with respect to other reads/writes.

I wanted to double check, so I compiled the same code with both Relaxed and SequentiallyConsistent and got exactly the same code generated on x86-64:

    lock add %esi,0x548(%rdi)

x86 has a fairly tight memory model and so there's no distinction, a relaxed operation is just as costly as a sequentially consistent one, and constrains the CPU WRT the order of its operations.

> Please just use the Zone counter and let's remove the one on the context, I
> think that removes a lot of complexity and then we also don't need to make
> realm/zone entering/leaving more complicated.

It'd be nice if that were simplier, however I think it's worth-while.  A relaxed atomic is the minimum we need for updating the zone counter, and by placing a per-context counter in each context we can avoid this expensive (and _more_ expensive on x86) atomic operation on the fairly-hot allocation path, and pay that const only when switching zones.

Comment 56

7 months ago
Pushed by pbone@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9022f602d146
(Part 1) - Track the number of tenured cells r=jonco
https://hg.mozilla.org/integration/mozilla-inbound/rev/8217b33680fa
(Part 2) - add a comment about an invariant required by the JIT r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/5f35c86a0d22
(Part 3) - Refactor JSContext code for changing zones r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/f3796f1a571b
(Part 4) - Add cells allocated statistics r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/e496e1a130b4
(Part 5) - Only record allocation counts on Nightly builds r=jonco
(In reply to Paul Bone [:pbone] from comment #55)
> I wanted to double check, so I compiled the same code with both Relaxed and
> SequentiallyConsistent and got exactly the same code generated on x86-64:
> 
>     lock add %esi,0x548(%rdi)

If you write it like this:

  counter = counter + 1

Then I get a simple "inc" instruction.

For SequentuallyConsistent, I get:

movl   0x82b5da(%rip), %eax
incl   %eax
xchgl  %eax, 0x82b5d2(%rip)

> x86 has a fairly tight memory model and so there's no distinction, a relaxed
> operation is just as costly as a sequentially consistent one, and constrains
> the CPU WRT the order of its operations.

That's not true. I've definitely seen big speedups on x86 from changing atomics from SequentiallyConsistent to ReleseAcquire or Relaxed.
Well, if you write it as "counter = counter + 1" then you should get a non-atomic "inc", with the risk of losing data - that inc is not interlocked, it performs the read and the write independently.  But if you have a relaxed atomic "a" and you do "a.inc()" then on x86 you really need to get "lock; inc".

For relaxed _atomic_ rmw operations on x86 there normally should be no significant perf difference from seq_cst because the only way the x86 knows how to implement these is using a bus lock.  For simple loads and stores, however, it's frequently the case that seq_cst loads and stores will need a fence (though not the lock prefix) while relaxed loads and stores will not; they will execute with the same speed as plain loads and stores.

So was this intended to be a potentially-lossy / best-effort operation or a precise count?
(Assignee)

Comment 59

7 months ago
(In reply to Jan de Mooij [:jandem] from comment #57)
> (In reply to Paul Bone [:pbone] from comment #55)
> > I wanted to double check, so I compiled the same code with both Relaxed and
> > SequentiallyConsistent and got exactly the same code generated on x86-64:
> > 
> >     lock add %esi,0x548(%rdi)
> 
> If you write it like this:
> 
>   counter = counter + 1
> 
> Then I get a simple "inc" instruction.
> 
> For SequentuallyConsistent, I get:
> 
> movl   0x82b5da(%rip), %eax
> incl   %eax
> xchgl  %eax, 0x82b5d2(%rip)

That's interesting, I hadn't seen that.

Also as Lars points out I'm concerned that counter = counter + 1 reads counter, then writes couter + 1 as seperate operations.

My code actualy does counter += x.  x can be any positive number.  And so I get the same locked add instruction that I showed earlier.

> > x86 has a fairly tight memory model and so there's no distinction, a relaxed
> > operation is just as costly as a sequentially consistent one, and constrains
> > the CPU WRT the order of its operations.
> 
> That's not true. I've definitely seen big speedups on x86 from changing
> atomics from SequentiallyConsistent to ReleseAcquire or Relaxed.

That's also interesting, I'm guessing it's because of moves that do/don't have fences like Lars said in the next comment.
(In reply to Paul Bone [:pbone] from comment #59)
> Also as Lars points out I'm concerned that counter = counter + 1 reads
> counter, then writes couter + 1 as seperate operations.

But that's fine because it's just for statistics and it only affects the atoms zone. Oh well.
(Assignee)

Comment 61

7 months ago
(In reply to Lars T Hansen [:lth] from comment #58)
> 
> So was this intended to be a potentially-lossy / best-effort operation or a
> precise count?

This is a precise count, but it doesn't have to be consistent with anything else.
You need to log in before you can comment on or make changes to this bug.