Reduce the amount of locking done by profiler_tracing() and profiler_feature_active()

RESOLVED FIXED in Firefox 55

Status

()

Core
Gecko Profiler
RESOLVED FIXED
4 months ago
27 days ago

People

(Reporter: Ehsan, Assigned: njn)

Tracking

(Blocks: 1 bug)

unspecified
mozilla55
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox55 fixed)

Details

Attachments

(4 attachments, 2 obsolete attachments)

See this profile from playing a youtube video in a single tab: <https://perfht.ml/2n7deyF>.

I think this amount of overhead for the profiler is really bad since it can skew the measurements.  Can we please think of some way to avoid these locks?  Thanks!
Flags: needinfo?(n.nethercote)
(Assignee)

Comment 1

4 months ago
The good news is that I had already half a plan to get rid of the extra locking that was added by bug 1342306. Looks like I will need to flehs it out into a whole plan :)

With careful use of Atomic<> I think we can make the following fields within ProfilerState accessible without the global lock: mIsActive, mFeature*, and mStartTime. Those will be enough to get rid of the vast majority of extra locking that bug 1342306 added.

(And if you're wondering why bug 1342306 added extra locking, it's because there was a ton of global state that was accessed with little or no synchronization, which was all very dubious. The code is much more obviously correct now.)
Assignee: nobody → n.nethercote
Flags: needinfo?(n.nethercote)
(Assignee)

Comment 2

4 months ago
> With careful use of Atomic<> I think we can make the following fields within
> ProfilerState accessible without the global lock: mIsActive, mFeature*, and
> mStartTime. Those will be enough to get rid of the vast majority of extra
> locking that bug 1342306 added.

That will work fine for mIsActive and mFeature*, which is enough to avoid the lock in profiler_feature_active(), which accounts for half of the locking overhead in the profile in comment 0.

But mStartTime is needed to avoid the lock in profiler_tracing(), which accounts for the other half of the overhead.
It's a mozilla::TimeStamp, which is not a type that can be made Atomic. (TimeStamp has a single TimeStampValue field. On Unix TimeStampValue contains a uint64_t field; on Windows TimeStampValue contains two uint64_t members and two bool members.)

Prior to bug 1342306, the sStartTime variable was simply handled in an entirely racy fashion -- written in profiler_init() and profiler_start() on the main thread; then read on arbitrary threads in profiler_tracing(), profiler_add_marker(), Tick(), and various other functions; all without any synchronization.

So at this point I'm uncertain how to deal with mStartTime.
Summary: Eliminate the cost of locking Gecko Profiler data structures → Reduce the amount of locking done by profiler_tracing() and profiler_feature_active()
(Assignee)

Comment 3

4 months ago
Created attachment 8847876 [details] [diff] [review]
(part 1) - Combine all profiler feature flags into a single int32_t

This will let us change it to Atomic<> and therefore be accessible without
gPSMutex being locked.
Attachment #8847876 - Flags: review?(mstange)
(Assignee)

Comment 4

4 months ago
Created attachment 8847884 [details] [diff] [review]
(part 2) - Allow some unlocked ProfilerState accesses

I wish this wasn't necessary :(

I think it's safe. Seems like the worse that could happen is we add a sample or
marker that we shouldn't, or fail to do one that we should.

Ehsan, this should fix about half of the overhead that you saw. Can you please
try it out and confirm if that's the case? Thank you.
Attachment #8847884 - Flags: review?(mstange)
Attachment #8847884 - Flags: feedback?(jseward)
Attachment #8847884 - Flags: feedback?(ehsan)
Ehsan, given that (IIUC) this is html5 video played directly by Gecko, I feel
like we should be able to reproduce this on all targets, not just Windows.  But
I can't see any trace of it on Linux.

Some STR questions, if I may:

* was this full screen or in a window?

* Do you have a specific video URL we can try?

* what build are you using?  If from source, debug or non-debug, opt or non-opt?

* what profiler settings are you using (sampling interval, buffer size, 
  # of threads being sampled)?

* any other possibly-relevant settings
Flags: needinfo?(ehsan)
(In reply to Julian Seward [:jseward] from comment #5)
> Ehsan, given that (IIUC) this is html5 video played directly by Gecko, I feel
> like we should be able to reproduce this on all targets, not just Windows. 
> But
> I can't see any trace of it on Linux.
> 
> Some STR questions, if I may:
> 
> * was this full screen or in a window?

This was a brand new profile on Windows, Nightly, after having gone to Youtube and clicking on a popular video.  The video wasn't full screen.

> * Do you have a specific video URL we can try?

It's in the profile.  :-) https://www.youtube.com/watch?v=_isYp06LPts

(FWIW this was a youtube suggestion, I'm not sure if the video is special.  We didn't try another video.)

> * what build are you using?  If from source, debug or non-debug, opt or
> non-opt?

Nightly from Moco.

> * what profiler settings are you using (sampling interval, buffer size, 
>   # of threads being sampled)?

All default.

> * any other possibly-relevant settings

Everything else in default settings.  Also on a fresh install of Windows.

Nick, sorry if my suggestion doesn't make much sense, I haven't worked on the profiler backend in a long time, but I was thinking about how to do this more easily, and I have an idea.  How about we make this pointer atomic: <http://searchfox.org/mozilla-central/rev/006005beff40d377cfd2f69d3400633c5ff09127/tools/profiler/public/GeckoProfiler.h#331> and enforce the following rules:

  * All ProfilerState objects are immutable.
  * ProfilerState objects are copyable.
  * ProfilerState objects are modifiable, where modification means making a copy and changing a specific field in the copy.

Then updating the profiler state would mean modifying the ProfilerState and making gPS point to the new object?
Flags: needinfo?(ehsan) → needinfo?(n.nethercote)
Also can you please make a Windows try build so I can try out the patch more easily?  Thanks!
I like Ehsan's suggestion about the immutable profiler state. But I wouldn't be surprised if Nicholas has a good explanation for why that's not possible.
(Assignee)

Comment 9

3 months ago
(In reply to Markus Stange [:mstange] from comment #8)
> I like Ehsan's suggestion about the immutable profiler state. But I wouldn't
> be surprised if Nicholas has a good explanation for why that's not possible.

ProfilerState represents global profiler state, such as:
- Is the profiler active? Is it paused?
- What features are on?
- What threads are being profiled?
- Info about all the live threads
- The sample buffer

Maybe I'm misunderstanding the proposal, but the idea of having multiple copies of any of that stuff makes zero sense to me.
Flags: needinfo?(n.nethercote)
(Assignee)

Comment 10

3 months ago
Looking again at the profile from comment 0, this locking is accounting for about 0.8% of the total profile, i.e. not that much. That might explain why Julian saw no sign of it on Linux. Is this level of overhead so bad?
(In reply to Nicholas Nethercote [:njn] from comment #9)
> Maybe I'm misunderstanding the proposal, [..]

I agree -- I also don't understand the proposal.  It sounds a bit like a
proposal to use read-copy-update idioms.  What problem would this solve?

I think we should not land any of the proposed patches until we have a proper
diagnosis of what's going on here.

My initial reaction to the profile in comment #0 is that it didn't show a lot
of locking overhead, but rather a lot of contention.  87.5% of the time goes
into RtlpEnterCriticalSectionContended, which is presumably the slow path that
defers to the kernel in the case that the fast path (atomic memory operation)
for acquiring the lock failed.  What we would call sys_futex on Linux, in
other words.

That's just me guessing that the Windows implementation of mutexes is
basically the same as the Linux implementation, though, and I don't know
whether that's true or not.
(In reply to Nicholas Nethercote [:njn] from comment #9)
> (In reply to Markus Stange [:mstange] from comment #8)
> > I like Ehsan's suggestion about the immutable profiler state. But I wouldn't
> > be surprised if Nicholas has a good explanation for why that's not possible.
> 
> ProfilerState represents global profiler state, such as:
> - Is the profiler active? Is it paused?
> - What features are on?
> - What threads are being profiled?

I think Ehsan's idea would work for the things listed above, but not so much for the things below:

> - Info about all the live threads
> - The sample buffer

(In reply to Julian Seward [:jseward] from comment #11)
> (In reply to Nicholas Nethercote [:njn] from comment #9)
> > Maybe I'm misunderstanding the proposal, [..]
> 
> I agree -- I also don't understand the proposal.  It sounds a bit like a
> proposal to use read-copy-update idioms.  What problem would this solve?

The current lock achieves the following:
 1. It makes sure that if you read multiple pieces of the state, these pieces will be consistent with each other.
 2. It makes sure that if you want to change parts of the state, you can do so without:
   2a. anybody else reading the state until you're done
   2b. anybody else changing the state until you're done.
 3. It makes sure that no two threads can read from the state at the same time.

As far as I'm concerned, 1 is necessary, 2a is arguably unnecessary, 2b is convenient, 3 is unfortunate.
Ehsan's proposal preserves 1, gets rid of 3, and does not address 2. For 2b it would probably require some kind of retry loop.


Anyway, I'm fine with going ahead with the current approach of the patches that are attached here.
> Ehsan's proposal preserves 1, gets rid of 3, and does not address 2. For 2b
> it would probably require some kind of retry loop.

Could we perhaps get read concurrency by using a reader-writer lock rather
than a mutex?
That could work.
(Assignee)

Comment 15

3 months ago
> > ProfilerState represents global profiler state, such as:
> > - Is the profiler active? Is it paused?
> > - What features are on?
> > - What threads are being profiled?
> 
> I think Ehsan's idea would work for the things listed above

How can it make sense to have multiple values for IsActive? I'm confused.


> Anyway, I'm fine with going ahead with the current approach of the patches that are attached here.

I would prefer not to go ahead with these patches, and a <1% overhead (comment 10) doesn't seem like it warrants the change.


> Could we perhaps get read concurrency by using a reader-writer lock rather than a mutex?

Aren't RWLocks most suitable when the number of reads greatly exceeds the number of writes? In this case that isn't true, because sampling requires writes and it runs at 1000Hz.
(In reply to Nicholas Nethercote [:njn] from comment #15)
> Aren't RWLocks most suitable when the number of reads greatly exceeds the
> number of writes? In this case that isn't true, because sampling requires
> writes and it runs at 1000Hz.

It's a different tradeoff.  They give read-side concurrency but at the extra
(development + maintenance) cost of having to prove that the no modifications to
shared state happen when the lock is held for reading.

I've yet to see in the Linux perf profiles, any evidence that we have
performance problems related to locking.  The costs appear to be elsewhere
(buffer management and unwinding).  That said, it would be a good idea to
profile the profiler on Windows, since otherwise we're just speculating that the
Linux results carry over.  I wonder if VTune could do this.
(In reply to Nicholas Nethercote [:njn] from comment #15)
> > > ProfilerState represents global profiler state, such as:
> > > - Is the profiler active? Is it paused?
> > > - What features are on?
> > > - What threads are being profiled?
> > 
> > I think Ehsan's idea would work for the things listed above
> 
> How can it make sense to have multiple values for IsActive? I'm confused.

Sorry for the vagueness of my idea...  I was hoping for it to not be taken at face value but to be used as a suggestion in places where it makes sense.

For example, like you suggest, I also don't think it makes any sense to have multiple values for IsActive.  But it also makes no sense to do any locking whatsoever for that, that boolean can simply be stored in an atomic global, if I understand the semantics involved correctly.

But I still think my suggestion may be useful for the list of enabled featured, and the list of profiled threads (speaking here as the least knowledgeable person on the profiler backend.)

> > Anyway, I'm fine with going ahead with the current approach of the patches that are attached here.
> 
> I would prefer not to go ahead with these patches, and a <1% overhead
> (comment 10) doesn't seem like it warrants the change.

1% is a *huge* overhead.  It's nothing if you're profiling watching a full youtube video, but it matters a lot if you're trying to for example evaluate a super hot code path.  I've been asking Markus for features like source line information in the hopes of making the Gecko Profiler useful for low level micro-optimization level profiling which I always have to drop down to a native profiler for these days, and these types of overheads would make the profiler unsuitable for those types of use cases.
(Assignee)

Comment 18

3 months ago
> For example, like you suggest, I also don't think it makes any sense to have
> multiple values for IsActive.  But it also makes no sense to do any locking
> whatsoever for that, that boolean can simply be stored in an atomic global,
> if I understand the semantics involved correctly.

Looking at IsActive in isolation is the easy case. The harder cases are where IsActive is consulted along with a couple of other pieces of the global state, and they all need to be consistent with each other. The atomicity needs to cover multiple values.

I now understand how this might work.

- There is a subset of ProfilerState that is only relevant when IsActive is true. Call it ActiveProfilerState.

- Furthermore, there is a subset of ActiveProfilerState that is immutable for any specific activation of the profiler (i.e. it's set by profiler_start() and doesn't change until the profiler is inactivated by profiler_stop()). Call it ImmutableActiveProfilerState. This includes mFeature* and mStartTime, which are used by the functions that show up in Ehsan's profile.

- In profiler_tracing, which needs to consult mIsActive and mFeaturePrivacy and mStartTime, we could atomically copy the ImmutableActiveProfilerState, and then query its members at leisure. That copy of ImmutableActiveProfilerState might become stale, but it's probably ok, because at worst we'd add a marker to the profiler that we arguably should not have.

- The tricky part is doing the atomic copy. ImmutableActiveProfilerState would have several fields, potentially including a couple of vectors and a TimeStamp, so you wouldn't want to actually copy it. An obvious alternative is to make it a refcounted type. But I'm not sure if our threadsafe refcount types provide strong enough properties to support this case. I think we'd need Atomic<RefPtr<ImmutableActiveProfilerState>>, and I don't think that's possible.
(In reply to Nicholas Nethercote [:njn] from comment #10)
> Looking again at the profile from comment 0, this locking is accounting for
> about 0.8% of the total profile, i.e. not that much. That might explain why
> Julian saw no sign of it on Linux. Is this level of overhead so bad?

I tested a Youtube video on Windows just now.  This is with bug 1345032 (further
speedups for FindLastSample) and bug 1344169 (the platform-* commoning up) in
place.  This is a non-debug, 32-bit Windows build at -O1, sampling all threads
at 1 KHz with a 9MB buffer.

For the GeckoMain thread, I see a cost of 1.4% for
RtlpEnterCriticalSectionContended, and 73% in NtWaitForMultipleObjects --
non-profiler-related waiting for other threads, I assume.

For a few other threads I looked at, RtlpEnterCriticalSectionContended doesn't
appear at all, and they are almost entirely parked up in
NtWaitForMultipleObjects or NtWaitForSingleObject.
(In reply to :Ehsan Akhgari (busy) from comment #17)

There's something I don't understand here.

> 1% is a *huge* overhead.  It's nothing if you're profiling watching a full
> youtube video, but it matters a lot if you're trying to for example evaluate
> a super hot code path.

Can you clarify what you mean by "evaluate"?  For a hot loop that uses say 90%
of the time, the profiler is going to produce the same call stack for 90% of the
samples, and so should accurately reflect the cost of that loop relative to the
rest of what's going on.  The loop will make absolute progress 1% slower because
of the profiler overheads, true.  Question is, in what sort of development
workflow does that 1% slowdown matter?
(Assignee)

Comment 21

3 months ago
> I tested a Youtube video on Windows just now.  This is with bug 1345032
> (further speedups for FindLastSample)

It's worth noting that the recent FindLastSampleOfThread improvements have been large. Bug 1344118 fixed a catastrophic performance cliff with it, and bug 1344258 and bug 1345032 got the constant factors down greatly, enough to reduce CPU usage by 10s of percentage points in common cases. But this don't show up in profiles because that code runs in the sampler thread.

In contrast, the extra locking caused by bug 1342306 does show up in profiles because the sample thread can see it.
(In reply to Julian Seward [:jseward] from comment #20)
> (In reply to :Ehsan Akhgari (busy) from comment #17)
> 
> There's something I don't understand here.
> 
> > 1% is a *huge* overhead.  It's nothing if you're profiling watching a full
> > youtube video, but it matters a lot if you're trying to for example evaluate
> > a super hot code path.
> 
> Can you clarify what you mean by "evaluate"?

Imperfect English.  I meant, trying to make sense of the performance characteristics of some code, not evaluate as in evaluate some code.  Sorry for the confusion.

>  For a hot loop that uses say
> 90%
> of the time, the profiler is going to produce the same call stack for 90% of
> the
> samples, and so should accurately reflect the cost of that loop relative to
> the
> rest of what's going on.  The loop will make absolute progress 1% slower
> because
> of the profiler overheads, true.  Question is, in what sort of development
> workflow does that 1% slowdown matter?

It's not just the 1% additional runtime that's added (we're not talking about just 1% extra computation, the added overhead here is locking, so we'll get context switches, etc.  I guess we can always use the native profilers for low level profiling like this (and this bug isn't the only case where we have Gecko Profiler specific overhead).  Although, comment 21 sounds hopeful, and I can always keep complaining if I see the locking overhead showing up in a profile in a way that actually blocks doing useful measurements.

I think we have more or less argued the theoretical aspects of this to death here. :-)
(Assignee)

Comment 23

3 months ago
Comment on attachment 8847876 [details] [diff] [review]
(part 1) - Combine all profiler feature flags into a single int32_t

I'm not keen on this patch.
Attachment #8847876 - Flags: review?(mstange)
(Assignee)

Updated

3 months ago
Attachment #8847884 - Flags: review?(mstange)
Attachment #8847884 - Flags: feedback?(jseward)
Attachment #8847884 - Flags: feedback?(ehsan)
The one place that really worries me here is the profiler_is_active() check in EventListenerManager::HandleEventInternal. When I added the PROFILER_LABEL_DYNAMIC there (in bug 1324941), smaug granted the review under the condition that profiler_is_active() is really cheap. Now that we're taking a lock in that function, I'm not sure that's really still true.
(Assignee)

Comment 25

3 months ago
> EventListenerManager::HandleEventInternal

Yes, the ad hoc profiling I've done in the past has shown that code to be quite hot. Bug 1351920 is a big help in the case where the profiler is active, because it reduces the number of Lock() calls from 3 to 2.
(In reply to Julian Seward [:jseward] from comment #11)
> My initial reaction to the profile in comment #0 is that it didn't show a lot
> of locking overhead, but rather a lot of contention.

My guess is that the contention is coming from the fact that the sampler thread grabs the lock during sampling. So that means while the sampler is sampling any thread, no thread can check whether the profiler is active or add a marker.

This is worrying because it really skews the runtime of these functions when the profiler is active vs when it's inactive.
(Assignee)

Updated

a month ago
Attachment #8847876 - Attachment is obsolete: true
(Assignee)

Updated

a month ago
Attachment #8847884 - Attachment is obsolete: true
(Assignee)

Comment 27

a month ago
Created attachment 8872524 [details] [diff] [review]
(part 1) - Rename some profiler parameters

The profiler is inconsistent about how it names some things, mostly parameters.
This patch fixes them as follows.

- A marker's identifier is now consistently |aMarkerName| (matching
  ProfilerMarker::mMarkerName), not |aMarker| or |aMarkerStr| or |aInfo|.

- Subclasses of ProfilerMarkerPayload are now consistently |payload|, not
  |marker|.

- The first string in a PseudoStack entry is now consistently |aLabel|
  (matching ProfileEntry::label_), not |aInfo|.
Attachment #8872524 - Flags: review?(mstange)
(Assignee)

Comment 28

a month ago
Created attachment 8872526 [details] [diff] [review]
(part 2) - Make CorePS::ProcessStartTime() lockless

It's immutable. This also means that profiler_time() is now lockless, though
that's a very rarely used function so it's not a performance win.
Attachment #8872526 - Flags: review?(mstange)
(Assignee)

Comment 29

a month ago
Created attachment 8872549 [details] [diff] [review]
(part 3) - Make some hot profiler functions lockless

The functions are profiler_is_active(), profiler_feature_active(),
profiler_tracing(), profiler_log(), profiler_add_marker().

This is achieved by adding RacyFeatures, which duplicates the features stored
in ActivePS in a way that can be accessed locklessly. There's a chance of
imprecision, but that doesn't matter for the way these functions are used; the
worst that can happen is we might very occasionally miss adding a marker, or
add one that we shouldn't.
Attachment #8872549 - Flags: review?(mstange)
Attachment #8872524 - Flags: review?(mstange) → review+
Attachment #8872526 - Flags: review?(mstange) → review+
Comment on attachment 8872549 [details] [diff] [review]
(part 3) - Make some hot profiler functions lockless

Review of attachment 8872549 [details] [diff] [review]:
-----------------------------------------------------------------

This is great. r+ with the RestyleTracker comment addressed. I don't insist on making my suggested changes to the "& Active" tests if you really prefer the current code.

::: layout/base/RestyleTracker.cpp
@@ +113,5 @@
>      if (uri) {
>        docURL = uri->GetSpecOrDefault();
>      }
> +    PROFILER_LABEL_DYNAMIC("RestyleTracker", "ProcessRestyles",
> +                           js::ProfileEntry::Category::CSS, docURL.get());

This label will be popped at the end of this scope. That's not what you want. Please move it back.

::: tools/profiler/core/platform.cpp
@@ +492,5 @@
> +  }
> +
> +  static void SetInactive() { sActiveAndFeatures = 0; }
> +
> +  static bool IsActive() { return uint32_t(sActiveAndFeatures) & Active; }

If "return sActiveAndFeatures != 0;" generates slightly faster code, then I'd prefer doing that, with an appropriate comment.

@@ +497,5 @@
> +
> +  static bool IsActiveWithFeature(uint32_t aFeature)
> +  {
> +    uint32_t af = sActiveAndFeatures;  // copy it first
> +    return (af & Active) && (af & aFeature);

Similarly here: We can assert that aFeature != 0 and then just check (af & aFeature).
If ProfilerFeature were an enum class, then we could have the aFeature parameter be of that enum type and wouldn't need the assert anymore.

If you do want to keep the & Active test, it would be nice to confirm whether this generates better code:
uint32_t bitsToCheck = Active | aFeature;
return (af & bitsToCheck) == bitsToCheck;

@@ +503,5 @@
> +
> +  static bool IsActiveWithoutPrivacy()
> +  {
> +    uint32_t af = sActiveAndFeatures;  // copy it first
> +    return (af & Active) && !(af & ProfilerFeature::Privacy);

and here: (af != 0) && !(af & ProfilerFeature::Privacy).
Attachment #8872549 - Flags: review?(mstange) → review+
(In reply to Markus Stange [:mstange] from comment #30)
> @@ +503,5 @@
> > +
> > +  static bool IsActiveWithoutPrivacy()
> > +  {
> > +    uint32_t af = sActiveAndFeatures;  // copy it first
> > +    return (af & Active) && !(af & ProfilerFeature::Privacy);
> 
> and here: (af != 0) && !(af & ProfilerFeature::Privacy).

This could also be:
> uint32_t bitsToCheck = Active | ProfilerFeature::Privacy;
> return (sActiveAndFeatures & bitsToCheck) == Active;

Sorry, I couldn't resist a bit of code golf. And I think that your code is clearer.
(Assignee)

Comment 32

29 days ago
We discussed the bit-fiddling suggestions on IRC. This code is hot enough that mutex locking can hurt performance, but not hot enough that bit-fiddling variations will affect performance. So I'll stick with the original code which is clearest.
(Assignee)

Comment 33

29 days ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/ad10343345034bf5f0f1c31e529af3c042427573
Bug 1347274 (part 1) - Rename some profiler parameters. r=mstange.

https://hg.mozilla.org/integration/mozilla-inbound/rev/79af4e3b4d718a368449a0a442c67ca3c1a7d3fd
Bug 1347274 (part 2) - Make CorePS::ProcessStartTime() lockless. r=mstange.

https://hg.mozilla.org/integration/mozilla-inbound/rev/8429360d3050f174202a1a8133489b6f7291348f
Bug 1347274 (part 3) - Make some hot profiler functions lockless. r=mstange.
(In reply to Markus Stange [:mstange] from comment #31)
> Sorry, I couldn't resist a bit of code golf.  [..]

"code golf" :-)  My new phrase for the day.

Unless the branches are unpredictable, I doubt you'd ever be able to measure a
difference between the variants.  Also, gcc/clang are expert code golfers and
may well transform between these and other variants, so it's pretty much moot
anyway.
Depends on: 1369004
sorry had to back this out for conflicts when merging m-c to m-i like:

(run 'hg heads' to see heads, 'hg merge' to merge)
merging layout/painting/nsDisplayList.cpp
merging tools/profiler/core/platform.cpp
warning: conflicts while merging tools/profiler/core/platform.cpp! (edit, then use 'hg resolve --mark')
299 files updated, 1 files merged, 87 files removed, 1 files unresolved
Flags: needinfo?(n.nethercote)

Comment 36

29 days ago
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4bd2d471693e
Backed out changeset 8429360d3050 
https://hg.mozilla.org/integration/mozilla-inbound/rev/f7a450e270f0
Backed out changeset 79af4e3b4d71 
https://hg.mozilla.org/integration/mozilla-inbound/rev/87c84b9fd8a4
Backed out changeset ad1034334503 for conflict merging m-c to m-i
as a note, when this landed, it caused many AWSY regressions:
== Change summary for alert #6971 (as of May 31 2017 05:05 UTC) ==

Regressions:

 87%  Heap Unclassified summary windows10-64-vm opt      44,776,383.33 -> 83,804,502.22
 83%  Heap Unclassified summary windows7-32-vm opt       39,672,191.03 -> 72,543,357.53
 56%  Heap Unclassified summary linux64 opt              55,110,244.56 -> 85,784,786.08
 28%  Explicit Memory summary windows7-32-vm opt         125,598,464.84 -> 160,306,294.02
 25%  Explicit Memory summary windows10-64-vm opt        163,059,854.78 -> 203,559,651.00
 19%  Explicit Memory summary linux64 opt                165,124,477.21 -> 195,901,142.94
  9%  Resident Memory summary windows10-64-vm opt        663,579,979.08 -> 723,807,557.14
  7%  Resident Memory summary linux64 opt                996,290,815.26 -> 1,065,474,102.23
  6%  Resident Memory summary windows7-32-vm opt         415,827,019.01 -> 439,865,829.80

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=6971
The backouts brought some memory usage improvements: https://treeherder.mozilla.org/perf.html#/alerts?id=6976

Were we leaking memory somehow?
Apparently so! There must be a serious bug somewhere in these patches.
Comment on attachment 8872549 [details] [diff] [review]
(part 3) - Make some hot profiler functions lockless

Review of attachment 8872549 [details] [diff] [review]:
-----------------------------------------------------------------

Found it!

::: tools/profiler/core/platform.cpp
@@ +2976,5 @@
>  
>    MOZ_RELEASE_ASSERT(CorePS::Exists());
>  
> +  // This function is hot enough that we use RacyFeatures, notActivePS.
> +  if (RacyFeatures::IsActiveWithoutPrivacy()) {

This check is missing a negation.
Turns out this also broke a couple Windows talos suites. Make sure to double-check those before re-landing too.

https://treeherder.mozilla.org/logviewer.html#?job_id=103322746&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=103321258&repo=mozilla-inbound
(Assignee)

Comment 42

28 days ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/85e6c77120d20c57254a722133bd7305e4839307
Bug 1347274 (part 1, attempt 2) - Rename some profiler parameters. r=mstange.

https://hg.mozilla.org/integration/mozilla-inbound/rev/9bf2eafa8defa9b0d69bfe8c757ca69826543e9b
Bug 1347274 (part 2, attempt 2) - Make CorePS::ProcessStartTime() lockless. r=mstange.
(Assignee)

Comment 43

28 days ago
> > +  // This function is hot enough that we use RacyFeatures, notActivePS.
> > +  if (RacyFeatures::IsActiveWithoutPrivacy()) {
> 
> This check is missing a negation.

Correct! Thank you for identifying.

This defect means that markers get added to the profiler's marker list when the profiler isn't running. And because the profiler isn't running, the main sampling loop isn't consuming the marker list. So the marker list grows without bound.

I can add a test for this by introducing a new subclass of ProfilerMarkerPayload, adding some of them with profiler_add_marker(), and then making sure they get destroyed as expected.
Flags: needinfo?(n.nethercote)
(Assignee)

Comment 44

28 days ago
Created attachment 8873288 [details] [diff] [review]
(part 4) - Improve the profiler's gtest

Two improvements.

- More testing of markers by introducing and using a ProfilerMarkerPayload
  subclass that keeps track of how many instances have been created, streamed,
  and destroyed. This change would have caught the defect that was in the first
  landing of part 3 of this bug.

- More testing of JSON output, to check for various substrings that are
  expected to be present.
Attachment #8873288 - Flags: review?(mstange)
Comment on attachment 8873288 [details] [diff] [review]
(part 4) - Improve the profiler's gtest

Review of attachment 8873288 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/tests/gtest/GeckoProfiler.cpp
@@ +351,5 @@
> +  // The number of GTestPayload instances that have been created, streamed, and
> +  // destroyed.
> +  static int mNumCreated;
> +  static int mNumStreamed;
> +  static int mNumDestroyed;

I think these statics usually start with an s, not an m.
Attachment #8873288 - Flags: review?(mstange) → review+

Comment 46

28 days ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/85e6c77120d2
https://hg.mozilla.org/mozilla-central/rev/9bf2eafa8def
Status: NEW → RESOLVED
Last Resolved: 28 days ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
(Assignee)

Comment 47

27 days ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/90f080141b3d55ec2dde80ec2b03314f04838180
Bug 1347274 (part 3, attempt 2) - Make some hot profiler functions lockless. r=mstange.

https://hg.mozilla.org/integration/mozilla-inbound/rev/b0e196c8fcd770b9ce8be81ff0fbcffabc7b8808
Bug 1347274 (part 4) - Improve the profiler's gtest. r=mstange.
(Assignee)

Updated

27 days ago
Blocks: 1348402

Comment 48

27 days ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/90f080141b3d
https://hg.mozilla.org/mozilla-central/rev/b0e196c8fcd7
You need to log in before you can comment on or make changes to this bug.