Fix the profiler's sleeping threads optimization

RESOLVED FIXED in Firefox 54

Status

()

Core
Gecko Profiler
RESOLVED FIXED
6 months ago
6 months ago

People

(Reporter: njn, Assigned: njn)

Tracking

Trunk
mozilla54
Points:
---

Firefox Tracking Flags

(firefox54 fixed)

Details

Attachments

(1 attachment)

(Assignee)

Description

6 months ago
Here's a profiler performance problem that Julian identified. If you start the browser and turn profiling on for all threads (by putting "," in the "Threads" box in the profiler settings) the following happens.

- It runs ok for a while, maybe a minute or so. On my Linux box, CPU usage is at about 30%.

- Then it hits an inflection point and the CPU usage jumps to 100%. |perf top| reveals that huge amounts of time are being spent in ProfileBuffer::FindLastSampleOfThread().

After some digging, I've determined that the problem is in the profiler's sleeping threads optimization, which works like so: when a thread being sampled is asleep and has been sampled at least once previously while asleep, the profiler will duplicate the previous sample (modulo timestamps). This is faster than taking a full sample. This is done by ProfileBuffer::DuplicateLastSample(), which uses ProfileBuffer::FindLastSampleOfThread() to find the previous sample.

Here's the problem. When ThreadInfo::FlushSamplesAndMarkers() is called, ProfileBuffer::reset() is called, which discards all previous samples. So the next time we sample, FindLastSampleOfThread() fails. This wouldn't be a problem, *except* that DuplicateLastSample() just swallows the failure. The sampling loop assumes that the duplication succeeded. Next time around the sampling loop, the same thing happens, except the time taken for FindLastSampleOfThread() to fail is longer because there is more data in the ProfileBuffer to fruitlessly search through. And so on.

So, this causes two problems:

- Missing samples for sleeping threads.

- Excessive CPU usage.

As far as I can tell, this bug has been present ever since the sleeping thread optimization landed in bug 963158, three years ago.

There are two possible fixes:

- Handle duplication failure in the sampling loop. This is straightforward.

- Remove the sleeping thread optimization.

I'm sorely tempted to do the latter, but it increases CPU usage on my Linux box when profiling all threads from ~30% to ~50%, so I've done the former.
(Assignee)

Comment 1

6 months ago
Created attachment 8843182 [details] [diff] [review]
Fix the profiler's sleeping threads optimization

When ProfilerBuffer::reset() is called, DuplicateLastSample() will start
failing for all sleeping threads because there will be no prior thread data in
the buffer to duplicate. But the sampling loop doesn't detect such failure.
This causes two problems:

- Missing samples.

- CPU usage goes through the roof, because each time around the sampling loop
  the length of the failing search increases.

The fix is simple: detect failure in the sampling loop and do a normal sample
in that case.

The patch also removes ThreadInfo::DuplicateLastSample(), because it just calls
onto ProfileBuffer::DuplicateLastSample().
Attachment #8843182 - Flags: review?(jseward)
(Assignee)

Updated

6 months ago
Blocks: 963158
Comment on attachment 8843182 [details] [diff] [review]
Fix the profiler's sleeping threads optimization

Review of attachment 8843182 [details] [diff] [review]:
-----------------------------------------------------------------

This is a nice fix.  For me, it appears to remove the inflection point
and the pathological behaviour that follows, in which we repeatedly
search through all samples and fail to find what we're looking for.

Even with the fix in place, FindLastSampleOfThread is still the hottest
function when the profiler is running, so there's scope for further
improvement, but not as part of this particular bug.

::: tools/profiler/core/platform-linux-android.cpp
@@ +364,5 @@
>          // This will be null if we're not interested in profiling this thread.
>          if (!info->HasProfile() || info->IsPendingDelete()) {
>            continue;
>          }
>  

It would be nice to have a short comment here explaining the rationale, given that this is the main sampling loop.  Perhaps:

// If the thread is asleep and has been sampled before in the same sleep
// episode, find and copy the previous sample, as that's cheaper than
// taking a new sample.

@@ +366,5 @@
>            continue;
>          }
>  
> +        if (info->Stack()->CanDuplicateLastSampleDueToSleep() &&
> +            gBuffer->DuplicateLastSample(info->ThreadId(), gStartTime)) {

Meh, complex side effects in conditional.  Not essential, but would be clearer imo thusly:

if (info->Stack()->CanDuplicateLastSampleDueToSleep()) {
   bool dup_ok = gBuffer->DuplicateLastSample(..);
   if (dup_ok)
      continue;
}
Attachment #8843182 - Flags: review?(jseward) → review+
(In reply to Julian Seward [:jseward] from comment #2)

> ::: tools/profiler/core/platform-linux-android.cpp
> @@ +364,5 @@
> >          // This will be null if we're not interested in profiling this thread.
> >          if (!info->HasProfile() || info->IsPendingDelete()) {
> >            continue;
> >          }
> >  
> 
> It would be nice to have a short comment here explaining the rationale,
> given that this is the main sampling loop.  Perhaps:

I meant for that review comment to be attached to the line

+        if (info->Stack()->CanDuplicateLastSampleDueToSleep() &&

I think I must have clicked on the wrong line in the review.
(Assignee)

Comment 4

6 months ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/4e196d802c7be7f3a3c147cc6c5b6406656584b1
Bug 1344118 - Fix the profiler's sleeping threads optimization. r=jseward.
(Assignee)

Comment 5

6 months ago
Thank you for the fast review.

I landed the patches without addressing the comments because bug 1342306 has two patches that modify the main sampling loop significantly and rebasing over that would be painful.

However, I created bug 1344169 and referred back to these comments from it, so they can be addressed after bug 1342306 lands.

Comment 6

6 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/4e196d802c7b
Status: ASSIGNED → RESOLVED
Last Resolved: 6 months ago
status-firefox54: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
You need to log in before you can comment on or make changes to this bug.