Closed Bug 1382440 Opened 7 years ago Closed 7 years ago

Annotate BHR hangs that occur while system is under external load

Categories

(Core :: XPCOM, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: alexical, Assigned: alexical)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

If load on the user's machine is high, we can assume that any BHR data we get during that time is of low value, and we should give the UI the option of filtering it out.

It's hard to say what the watermark for this should be. Should we annotate if there isn't a free core, or if total usage outside of our current process is above some threshold like 50%?
I'm thinking of getting a start on this using performance counters to track system-wide CPU usage. However if anyone has any concerns with that approach let me know!
Assignee: nobody → dothayer
Status: NEW → ASSIGNED
Nathan, feel free to recommend this to someone else for review - I wasn't sure who to request. Also, I understand this will need to change a little once the work in Bug 1380081 is landed, but I figure the core of the work will remain intact, so I put it up for review now.
Comment on attachment 8892960 [details]
Bug 1382440 - Watch CPU usage in BHR

https://reviewboard.mozilla.org/r/162220/#review169328

This all seems pretty reasonable, some comments below.

::: xpcom/threads/BackgroundHangMonitor.cpp:321
(Diff revision 1)
>    // Default values for the first iteration of thread loop
>    PRIntervalTime waitTime = PR_INTERVAL_NO_WAIT;
>    PRIntervalTime recheckTimeout = PR_INTERVAL_NO_WAIT;
>  
>    while (!mShutdown) {
> +    Unused << mCPUUsageWatcher.CollectCPUUsage().unwrap();

We have no guarantees that this operation (or other operations) is error-free; a blind `unwrap` is not appropriate.

::: xpcom/threads/CPUUsageWatcher.h:31
(Diff revision 1)
> +class CPUUsageWatcher
> +  : public HangMonitor::Annotator

Adding a no-arg constructor that initializes all the members appropriately would be good.  C++ will give you one for free, but we prefer to be explicit when possible.  (Static analyses might complain too?)

::: xpcom/threads/CPUUsageWatcher.h:36
(Diff revision 1)
> +class CPUUsageWatcher
> +  : public HangMonitor::Annotator
> +{
> +public:
> +  Result<Ok, CPUUsageWatcherError>
> +  Startup();

Since `CPUUsageWatcher` is intended to be a member, not a global singleton, it's probably more appropriate to call this `Init`.

::: xpcom/threads/CPUUsageWatcher.h:38
(Diff revision 1)
> +  void
> +  Shutdown();

Nit: for declarations that fit on a single line, I think we typically put them on a single line, and only break them as done here if needed.  Likewise for all the other declarations in this class.

::: xpcom/threads/CPUUsageWatcher.cpp:7
(Diff revision 1)
> +#include "prsystem.h"
> +
> +#ifdef XP_MACOSX
> +#include <mach/mach_host.h>
> +#endif
> +
> +#ifdef XP_LINUX
> +#include <stdio.h>
> +#endif

Surely we should have an `#include "mozilla/CPUUsageWatcher.h"` somewhere in here?  (I realize it's found because of unified compilation, but we should try to not be too cavalier about these things.)

::: xpcom/threads/CPUUsageWatcher.cpp:56
(Diff revision 1)
> +  uint64_t usageTime;
> +  uint64_t updateTime;

These need some kind of documentation.

::: xpcom/threads/CPUUsageWatcher.cpp:68
(Diff revision 1)
> +GetProcessCPUStats() {
> +  CPUStats result = {};
> +  MOZ_TRY_VAR(result.usageTime, GetClockTime(CLOCK_PROCESS_CPUTIME_ID));
> +  MOZ_TRY_VAR(result.updateTime, GetClockTime(CLOCK_MONOTONIC));
> +  int32_t numCpus = PR_GetNumberOfProcessors();
> +  result.usageTime /= numCpus;

Why are we doing this?  Is the intent to try and measure how much time we might have consumed on a single CPU?

::: xpcom/threads/CPUUsageWatcher.cpp:190
(Diff revision 1)
> +    // We can't collect this information on Linux due to sandboxing and our
> +    // need to read from /proc/stat.

This comment seems inaccurate; we can't get global CPU stats, AIUI, but we can get process CPU stats?
Attachment #8892960 - Flags: review?(nfroyd)
Also, PR_GetNumberOfProcessors can fail, so some code needs to be added to handle that case.
Comment on attachment 8892960 [details]
Bug 1382440 - Watch CPU usage in BHR

https://reviewboard.mozilla.org/r/162220/#review169874

Thanks for the rapid update.  A few more higher-level comments and questions.

::: xpcom/threads/BackgroundHangMonitor.cpp:321
(Diff revision 2)
>    // Default values for the first iteration of thread loop
>    PRIntervalTime waitTime = PR_INTERVAL_NO_WAIT;
>    PRIntervalTime recheckTimeout = PR_INTERVAL_NO_WAIT;
>  
>    while (!mShutdown) {
> +    Unused << NS_WARN_IF(mCPUUsageWatcher.CollectCPUUsage().isErr());

Do you have any idea how often this executes, on average?  I realize it needs to update reasonably often, but I have no sense of how often we go through this loop.  If it's several times a second, or even every several seconds, we should find some other place to put it.

::: xpcom/threads/CPUUsageWatcher.cpp:73
(Diff revision 2)
> +  int32_t numCpus = PR_GetNumberOfProcessors();
> +  if (numCpus <= 0) {
> +    return Err(GetNumberOfProcessorsError);
> +  }

Could we get by with not determining the number of processesors on every call to these functions?  I realize the number of CPUs could (?) update dynamically on some some systems, but we've already resigned ourselves to that fact by fixing `mExternalUsageThreshold` in `CPUUsageWatcher::Init`.

::: xpcom/threads/CPUUsageWatcher.cpp:133
(Diff revision 2)
> +  result.usageTime = loadInfo.cpu_ticks[CPU_STATE_USER] +
> +                     loadInfo.cpu_ticks[CPU_STATE_NICE] +
> +                     loadInfo.cpu_ticks[CPU_STATE_SYSTEM];

The documentation for `usageTime` isn't really accurate for the global stats, perhaps we need a separate structure for global CPU time, or we need a more accurate documentation string for `usageTime`?

::: xpcom/threads/CPUUsageWatcher.cpp:144
(Diff revision 2)
> +  CPUStats result = {};
> +  FILE* f = fopen("/proc/stat", "r");

I am not super-excited about opening files every time we have to get global CPU stats; we've seen cases on Linux where touching files in `/proc` can result in very bad jank, and this code is liable to touch said files significantly more often.  Is there any other method we could use to get the information we want here?  Could we just disable this code on Linux entirely?
Attachment #8892960 - Flags: review?(nfroyd)
Comment on attachment 8892960 [details]
Bug 1382440 - Watch CPU usage in BHR

https://reviewboard.mozilla.org/r/162220/#review169874

> Could we get by with not determining the number of processesors on every call to these functions?  I realize the number of CPUs could (?) update dynamically on some some systems, but we've already resigned ourselves to that fact by fixing `mExternalUsageThreshold` in `CPUUsageWatcher::Init`.

Yeah that's fair. I'll cache it.

> The documentation for `usageTime` isn't really accurate for the global stats, perhaps we need a separate structure for global CPU time, or we need a more accurate documentation string for `usageTime`?

Yeah, I think just better documentation will suffice. Really we just need to ensure that usageTime is in the same units as updateTime, which in this particular case is total ticks across all cores.

> I am not super-excited about opening files every time we have to get global CPU stats; we've seen cases on Linux where touching files in `/proc` can result in very bad jank, and this code is liable to touch said files significantly more often.  Is there any other method we could use to get the information we want here?  Could we just disable this code on Linux entirely?

I couldn't find anything other than /proc/stat unfortunately. It's also what `top` uses internally. I'm fine with just disabling this on Linux.
Comment on attachment 8892960 [details]
Bug 1382440 - Watch CPU usage in BHR

https://reviewboard.mozilla.org/r/162220/#review169874

> Do you have any idea how often this executes, on average?  I realize it needs to update reasonably often, but I have no sense of how often we go through this loop.  If it's several times a second, or even every several seconds, we should find some other place to put it.

It does run fairly frequently. Maybe the best solution is to just move it into an nsITimer callback every second and change mExternalUsageRatio to be atomic. This would also make it easier to use the CPUUsageWatcher from other code (profiler?) without running another instance of it. In which case it should probably be static. Thoughts?
(In reply to Doug Thayer [:dthayer] from comment #9)
> Comment on attachment 8892960 [details]
> Bug 1382440 - Watch CPU usage in BHR
> 
> https://reviewboard.mozilla.org/r/162220/#review169874
> 
> > Do you have any idea how often this executes, on average?  I realize it needs to update reasonably often, but I have no sense of how often we go through this loop.  If it's several times a second, or even every several seconds, we should find some other place to put it.
> 
> It does run fairly frequently. Maybe the best solution is to just move it
> into an nsITimer callback every second and change mExternalUsageRatio to be
> atomic. This would also make it easier to use the CPUUsageWatcher from other
> code (profiler?) without running another instance of it. In which case it
> should probably be static. Thoughts?

We could do that.  I think running it every second is still pretty extreme though--we already have enough problems with battery life and timers firing off at random timers.
Or we could just modify the core loop of BHR to only update every so often, and we wouldn't have to bother with timers.  I'm not sure how much this code would get used by other consumers.
(In reply to Nathan Froyd [:froydnj] from comment #11)
> Or we could just modify the core loop of BHR to only update every so often,
> and we wouldn't have to bother with timers.  I'm not sure how much this code
> would get used by other consumers.

Well, there is Bug 1329600, but maybe that should use separate code.

By only update so often, do you mean run the BHR loops as frequently as we do now (since I believe that's necessary), but have it only update the CPUUsageWatcher occasionally? I'm fine with that - it's what the CPUUsageWatcher is doing internally, though with a bit more work than what's necessary. I'm happy to move that out into BHR. What would you say is a good update interval? If it's not running regularly though we'll want to make sure that we update it right when we detect a hang, so that if we wake up and immediately hang we're not using the mean CPU usage over the time that we were asleep.
Quick note on the above review: I had to move "BackgroundHangMonitor::Shutdown()" above "delete sMainHangMonitor" in order to unregister the annotator before the HangMonitor's destructor, which wants to ensure that all monitors are cleaned up before it runs. I couldn't see any reason why it wasn't okay to do so, but if you know of anything, let me know.
Comment on attachment 8892960 [details]
Bug 1382440 - Watch CPU usage in BHR

https://reviewboard.mozilla.org/r/162220/#review175500

Sorry for the delay; mozreview decided to not send me email about the review request. :(

r=me, but would like to discuss the below before this gets committed.

::: xpcom/threads/BackgroundHangMonitor.cpp:43
(Diff revisions 2 - 3)
> +// Interval at which we check the global and per-process CPU usage in order to determine
> +// if there is high external CPU usage.
> +static const int32_t kCheckCPUIntervalMilliseconds = 2000;

I would really prefer this to be a little higher, 5 or even 15 seconds would be better in my mind.

Is it possible to set it high, and then see what the data coming in from BHR looks like?  If we're getting a lot of reports that aren't useful when we use a high interval, we could try lowering it and see if that improves anything.  And if it doesn't improve anything, well, that's valuable information too.

::: xpcom/threads/CPUUsageWatcher.cpp:30
(Diff revisions 2 - 3)
> -static const uint64_t kFILETIMETicksPerSecond = 10000000;
> -static const uint64_t kCPUCheckInterval = kFILETIMETicksPerSecond / 2;
> +  // system, or averaged between them. Whichever it is, it just needs to
> +  // mirror what updateTime measures.

Maybe "...it needs to be expressed in the same units as updateTime"?  It's not measuring the same thing, after all, which is what "just needs to mirror what updateTime measures" suggests to me.
Attachment #8892960 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #15)
> Is it possible to set it high, and then see what the data coming in from BHR
> looks like?  If we're getting a lot of reports that aren't useful when we
> use a high interval, we could try lowering it and see if that improves
> anything.  And if it doesn't improve anything, well, that's valuable
> information too.

I'm fine with setting it high and pulling it back depending on what we see, since the resolution _shouldn't_ be that important, and it's valuable information anyway if it turns out that resolution is important, but I just wanted to check in about what's prompting the reservations about this running every two seconds. On my machine the Windows side of things adds up to 2 microseconds, and this is in a background thread. At what point would it be cheap enough to run more frequently?

> Maybe "...it needs to be expressed in the same units as updateTime"?  It's
> not measuring the same thing, after all, which is what "just needs to mirror
> what updateTime measures" suggests to me.

Works for me!
Flags: needinfo?(nfroyd)
(In reply to Doug Thayer [:dthayer] from comment #16)
> (In reply to Nathan Froyd [:froydnj] from comment #15)
> > Is it possible to set it high, and then see what the data coming in from BHR
> > looks like?  If we're getting a lot of reports that aren't useful when we
> > use a high interval, we could try lowering it and see if that improves
> > anything.  And if it doesn't improve anything, well, that's valuable
> > information too.
> 
> I'm fine with setting it high and pulling it back depending on what we see,
> since the resolution _shouldn't_ be that important, and it's valuable
> information anyway if it turns out that resolution is important, but I just
> wanted to check in about what's prompting the reservations about this
> running every two seconds. On my machine the Windows side of things adds up
> to 2 microseconds, and this is in a background thread. At what point would
> it be cheap enough to run more frequently?

Ah, for some reason I was assuming that taking this measurement is waking us up out of some idle state, which would burn power, battery, etc.  I see that I was mistaken: we're running this when the background hang thread has woken up, so we're already doing "useful" work at that point.

Having the interval between measurements be two seconds should be fine, just so long as this isn't making the hang thread wake up more often than it is.
Flags: needinfo?(nfroyd)
Backout by kwierso@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/a117fbab8be5
Backed out changeset 528d2ea1c438 for build failures in CPUUsageWatcher.h a=backout
Sorry for the delay on this.

Here's the try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f738596accba3162b4e190f5776632e2558efeef
Flags: needinfo?(dothayer)
Comment on attachment 8904627 [details]
Bug 1382440 - Fix CPUUsageWatcher on OSX and Linux

https://reviewboard.mozilla.org/r/176446/#review181936

This seems like a little bit of a heavy hammer, but OK.
Attachment #8904627 - Flags: review?(nfroyd) → review+
Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b84af3674b49
Watch CPU usage in BHR r=froydnj
https://hg.mozilla.org/integration/autoland/rev/252f4499d372
Fix CPUUsageWatcher on OSX and Linux r=froydnj
https://hg.mozilla.org/mozilla-central/rev/b84af3674b49
https://hg.mozilla.org/mozilla-central/rev/252f4499d372
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Well, this had somewhat of a dismal showing in yesterday's telemetry:

defaultdict(int,
            {u'ExternalCPUHigh': 148,
             u'PendingInput': 1822033,
             u'UserInteracting': 1904376,
             u'pluginName': 2678,
             u'pluginVersion': 2678})

Trying to sort out if that's an accurate portrayal of the situation or not. Any time I compile code on my machine and check the BHR ping it shows up in all of the hangs, and I would expect that Nightly would have a higher population with this annotation given developers compiling code and doing other things while they wait, but I suppose not.
Scratch that - I thought I was using appBuildId in the script I had for this, but I was using submissionDate. Here's the correct numbers:

defaultdict(int,
            {u'ExternalCPUHigh': 76040,
             u'PendingInput': 874546,
             u'UserInteracting': 885618,
             u'pluginName': 871,
             u'pluginVersion': 871})
Depends on: 1530617
You need to log in before you can comment on or make changes to this bug.