Closed
Bug 1382440
Opened 7 years ago
Closed 7 years ago
Annotate BHR hangs that occur while system is under external load
Categories
(Core :: XPCOM, enhancement)
Core
XPCOM
Tracking
()
RESOLVED
FIXED
mozilla57
Tracking | Status | |
---|---|---|
firefox57 | --- | fixed |
People
(Reporter: alexical, Assigned: alexical)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
If load on the user's machine is high, we can assume that any BHR data we get during that time is of low value, and we should give the UI the option of filtering it out. It's hard to say what the watermark for this should be. Should we annotate if there isn't a free core, or if total usage outside of our current process is above some threshold like 50%?
Assignee | ||
Comment 1•7 years ago
|
||
I'm thinking of getting a start on this using performance counters to track system-wide CPU usage. However if anyone has any concerns with that approach let me know!
Assignee: nobody → dothayer
Status: NEW → ASSIGNED
Comment hidden (mozreview-request) |
Assignee | ||
Comment 3•7 years ago
|
||
Nathan, feel free to recommend this to someone else for review - I wasn't sure who to request. Also, I understand this will need to change a little once the work in Bug 1380081 is landed, but I figure the core of the work will remain intact, so I put it up for review now.
Comment 4•7 years ago
|
||
mozreview-review |
Comment on attachment 8892960 [details] Bug 1382440 - Watch CPU usage in BHR https://reviewboard.mozilla.org/r/162220/#review169328 This all seems pretty reasonable, some comments below. ::: xpcom/threads/BackgroundHangMonitor.cpp:321 (Diff revision 1) > // Default values for the first iteration of thread loop > PRIntervalTime waitTime = PR_INTERVAL_NO_WAIT; > PRIntervalTime recheckTimeout = PR_INTERVAL_NO_WAIT; > > while (!mShutdown) { > + Unused << mCPUUsageWatcher.CollectCPUUsage().unwrap(); We have no guarantees that this operation (or other operations) is error-free; a blind `unwrap` is not appropriate. ::: xpcom/threads/CPUUsageWatcher.h:31 (Diff revision 1) > +class CPUUsageWatcher > + : public HangMonitor::Annotator Adding a no-arg constructor that initializes all the members appropriately would be good. C++ will give you one for free, but we prefer to be explicit when possible. (Static analyses might complain too?) ::: xpcom/threads/CPUUsageWatcher.h:36 (Diff revision 1) > +class CPUUsageWatcher > + : public HangMonitor::Annotator > +{ > +public: > + Result<Ok, CPUUsageWatcherError> > + Startup(); Since `CPUUsageWatcher` is intended to be a member, not a global singleton, it's probably more appropriate to call this `Init`. ::: xpcom/threads/CPUUsageWatcher.h:38 (Diff revision 1) > + void > + Shutdown(); Nit: for declarations that fit on a single line, I think we typically put them on a single line, and only break them as done here if needed. Likewise for all the other declarations in this class. ::: xpcom/threads/CPUUsageWatcher.cpp:7 (Diff revision 1) > +#include "prsystem.h" > + > +#ifdef XP_MACOSX > +#include <mach/mach_host.h> > +#endif > + > +#ifdef XP_LINUX > +#include <stdio.h> > +#endif Surely we should have an `#include "mozilla/CPUUsageWatcher.h"` somewhere in here? (I realize it's found because of unified compilation, but we should try to not be too cavalier about these things.) ::: xpcom/threads/CPUUsageWatcher.cpp:56 (Diff revision 1) > + uint64_t usageTime; > + uint64_t updateTime; These need some kind of documentation. ::: xpcom/threads/CPUUsageWatcher.cpp:68 (Diff revision 1) > +GetProcessCPUStats() { > + CPUStats result = {}; > + MOZ_TRY_VAR(result.usageTime, GetClockTime(CLOCK_PROCESS_CPUTIME_ID)); > + MOZ_TRY_VAR(result.updateTime, GetClockTime(CLOCK_MONOTONIC)); > + int32_t numCpus = PR_GetNumberOfProcessors(); > + result.usageTime /= numCpus; Why are we doing this? Is the intent to try and measure how much time we might have consumed on a single CPU? ::: xpcom/threads/CPUUsageWatcher.cpp:190 (Diff revision 1) > + // We can't collect this information on Linux due to sandboxing and our > + // need to read from /proc/stat. This comment seems inaccurate; we can't get global CPU stats, AIUI, but we can get process CPU stats?
Attachment #8892960 -
Flags: review?(nfroyd)
Comment 5•7 years ago
|
||
Also, PR_GetNumberOfProcessors can fail, so some code needs to be added to handle that case.
Comment hidden (mozreview-request) |
Comment 7•7 years ago
|
||
mozreview-review |
Comment on attachment 8892960 [details] Bug 1382440 - Watch CPU usage in BHR https://reviewboard.mozilla.org/r/162220/#review169874 Thanks for the rapid update. A few more higher-level comments and questions. ::: xpcom/threads/BackgroundHangMonitor.cpp:321 (Diff revision 2) > // Default values for the first iteration of thread loop > PRIntervalTime waitTime = PR_INTERVAL_NO_WAIT; > PRIntervalTime recheckTimeout = PR_INTERVAL_NO_WAIT; > > while (!mShutdown) { > + Unused << NS_WARN_IF(mCPUUsageWatcher.CollectCPUUsage().isErr()); Do you have any idea how often this executes, on average? I realize it needs to update reasonably often, but I have no sense of how often we go through this loop. If it's several times a second, or even every several seconds, we should find some other place to put it. ::: xpcom/threads/CPUUsageWatcher.cpp:73 (Diff revision 2) > + int32_t numCpus = PR_GetNumberOfProcessors(); > + if (numCpus <= 0) { > + return Err(GetNumberOfProcessorsError); > + } Could we get by with not determining the number of processesors on every call to these functions? I realize the number of CPUs could (?) update dynamically on some some systems, but we've already resigned ourselves to that fact by fixing `mExternalUsageThreshold` in `CPUUsageWatcher::Init`. ::: xpcom/threads/CPUUsageWatcher.cpp:133 (Diff revision 2) > + result.usageTime = loadInfo.cpu_ticks[CPU_STATE_USER] + > + loadInfo.cpu_ticks[CPU_STATE_NICE] + > + loadInfo.cpu_ticks[CPU_STATE_SYSTEM]; The documentation for `usageTime` isn't really accurate for the global stats, perhaps we need a separate structure for global CPU time, or we need a more accurate documentation string for `usageTime`? ::: xpcom/threads/CPUUsageWatcher.cpp:144 (Diff revision 2) > + CPUStats result = {}; > + FILE* f = fopen("/proc/stat", "r"); I am not super-excited about opening files every time we have to get global CPU stats; we've seen cases on Linux where touching files in `/proc` can result in very bad jank, and this code is liable to touch said files significantly more often. Is there any other method we could use to get the information we want here? Could we just disable this code on Linux entirely?
Attachment #8892960 -
Flags: review?(nfroyd)
Assignee | ||
Comment 8•7 years ago
|
||
mozreview-review-reply |
Comment on attachment 8892960 [details] Bug 1382440 - Watch CPU usage in BHR https://reviewboard.mozilla.org/r/162220/#review169874 > Could we get by with not determining the number of processesors on every call to these functions? I realize the number of CPUs could (?) update dynamically on some some systems, but we've already resigned ourselves to that fact by fixing `mExternalUsageThreshold` in `CPUUsageWatcher::Init`. Yeah that's fair. I'll cache it. > The documentation for `usageTime` isn't really accurate for the global stats, perhaps we need a separate structure for global CPU time, or we need a more accurate documentation string for `usageTime`? Yeah, I think just better documentation will suffice. Really we just need to ensure that usageTime is in the same units as updateTime, which in this particular case is total ticks across all cores. > I am not super-excited about opening files every time we have to get global CPU stats; we've seen cases on Linux where touching files in `/proc` can result in very bad jank, and this code is liable to touch said files significantly more often. Is there any other method we could use to get the information we want here? Could we just disable this code on Linux entirely? I couldn't find anything other than /proc/stat unfortunately. It's also what `top` uses internally. I'm fine with just disabling this on Linux.
Assignee | ||
Comment 9•7 years ago
|
||
mozreview-review-reply |
Comment on attachment 8892960 [details] Bug 1382440 - Watch CPU usage in BHR https://reviewboard.mozilla.org/r/162220/#review169874 > Do you have any idea how often this executes, on average? I realize it needs to update reasonably often, but I have no sense of how often we go through this loop. If it's several times a second, or even every several seconds, we should find some other place to put it. It does run fairly frequently. Maybe the best solution is to just move it into an nsITimer callback every second and change mExternalUsageRatio to be atomic. This would also make it easier to use the CPUUsageWatcher from other code (profiler?) without running another instance of it. In which case it should probably be static. Thoughts?
Comment 10•7 years ago
|
||
(In reply to Doug Thayer [:dthayer] from comment #9) > Comment on attachment 8892960 [details] > Bug 1382440 - Watch CPU usage in BHR > > https://reviewboard.mozilla.org/r/162220/#review169874 > > > Do you have any idea how often this executes, on average? I realize it needs to update reasonably often, but I have no sense of how often we go through this loop. If it's several times a second, or even every several seconds, we should find some other place to put it. > > It does run fairly frequently. Maybe the best solution is to just move it > into an nsITimer callback every second and change mExternalUsageRatio to be > atomic. This would also make it easier to use the CPUUsageWatcher from other > code (profiler?) without running another instance of it. In which case it > should probably be static. Thoughts? We could do that. I think running it every second is still pretty extreme though--we already have enough problems with battery life and timers firing off at random timers.
Comment 11•7 years ago
|
||
Or we could just modify the core loop of BHR to only update every so often, and we wouldn't have to bother with timers. I'm not sure how much this code would get used by other consumers.
Assignee | ||
Comment 12•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #11) > Or we could just modify the core loop of BHR to only update every so often, > and we wouldn't have to bother with timers. I'm not sure how much this code > would get used by other consumers. Well, there is Bug 1329600, but maybe that should use separate code. By only update so often, do you mean run the BHR loops as frequently as we do now (since I believe that's necessary), but have it only update the CPUUsageWatcher occasionally? I'm fine with that - it's what the CPUUsageWatcher is doing internally, though with a bit more work than what's necessary. I'm happy to move that out into BHR. What would you say is a good update interval? If it's not running regularly though we'll want to make sure that we update it right when we detect a hang, so that if we wake up and immediately hang we're not using the mean CPU usage over the time that we were asleep.
Comment hidden (mozreview-request) |
Assignee | ||
Comment 14•7 years ago
|
||
Quick note on the above review: I had to move "BackgroundHangMonitor::Shutdown()" above "delete sMainHangMonitor" in order to unregister the annotator before the HangMonitor's destructor, which wants to ensure that all monitors are cleaned up before it runs. I couldn't see any reason why it wasn't okay to do so, but if you know of anything, let me know.
Comment 15•7 years ago
|
||
mozreview-review |
Comment on attachment 8892960 [details] Bug 1382440 - Watch CPU usage in BHR https://reviewboard.mozilla.org/r/162220/#review175500 Sorry for the delay; mozreview decided to not send me email about the review request. :( r=me, but would like to discuss the below before this gets committed. ::: xpcom/threads/BackgroundHangMonitor.cpp:43 (Diff revisions 2 - 3) > +// Interval at which we check the global and per-process CPU usage in order to determine > +// if there is high external CPU usage. > +static const int32_t kCheckCPUIntervalMilliseconds = 2000; I would really prefer this to be a little higher, 5 or even 15 seconds would be better in my mind. Is it possible to set it high, and then see what the data coming in from BHR looks like? If we're getting a lot of reports that aren't useful when we use a high interval, we could try lowering it and see if that improves anything. And if it doesn't improve anything, well, that's valuable information too. ::: xpcom/threads/CPUUsageWatcher.cpp:30 (Diff revisions 2 - 3) > -static const uint64_t kFILETIMETicksPerSecond = 10000000; > -static const uint64_t kCPUCheckInterval = kFILETIMETicksPerSecond / 2; > + // system, or averaged between them. Whichever it is, it just needs to > + // mirror what updateTime measures. Maybe "...it needs to be expressed in the same units as updateTime"? It's not measuring the same thing, after all, which is what "just needs to mirror what updateTime measures" suggests to me.
Attachment #8892960 -
Flags: review?(nfroyd) → review+
Assignee | ||
Comment 16•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #15) > Is it possible to set it high, and then see what the data coming in from BHR > looks like? If we're getting a lot of reports that aren't useful when we > use a high interval, we could try lowering it and see if that improves > anything. And if it doesn't improve anything, well, that's valuable > information too. I'm fine with setting it high and pulling it back depending on what we see, since the resolution _shouldn't_ be that important, and it's valuable information anyway if it turns out that resolution is important, but I just wanted to check in about what's prompting the reservations about this running every two seconds. On my machine the Windows side of things adds up to 2 microseconds, and this is in a background thread. At what point would it be cheap enough to run more frequently? > Maybe "...it needs to be expressed in the same units as updateTime"? It's > not measuring the same thing, after all, which is what "just needs to mirror > what updateTime measures" suggests to me. Works for me!
Flags: needinfo?(nfroyd)
Comment 17•7 years ago
|
||
(In reply to Doug Thayer [:dthayer] from comment #16) > (In reply to Nathan Froyd [:froydnj] from comment #15) > > Is it possible to set it high, and then see what the data coming in from BHR > > looks like? If we're getting a lot of reports that aren't useful when we > > use a high interval, we could try lowering it and see if that improves > > anything. And if it doesn't improve anything, well, that's valuable > > information too. > > I'm fine with setting it high and pulling it back depending on what we see, > since the resolution _shouldn't_ be that important, and it's valuable > information anyway if it turns out that resolution is important, but I just > wanted to check in about what's prompting the reservations about this > running every two seconds. On my machine the Windows side of things adds up > to 2 microseconds, and this is in a background thread. At what point would > it be cheap enough to run more frequently? Ah, for some reason I was assuming that taking this measurement is waking us up out of some idle state, which would burn power, battery, etc. I see that I was mistaken: we're running this when the background hang thread has woken up, so we're already doing "useful" work at that point. Having the interval between measurements be two seconds should be fine, just so long as this isn't making the hang thread wake up more often than it is.
Flags: needinfo?(nfroyd)
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 20•7 years ago
|
||
Pushed by dothayer@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/528d2ea1c438 Watch CPU usage in BHR r=froydnj
Backed out for failures like https://treeherder.mozilla.org/logviewer.html#?job_id=125699978&repo=autoland
Flags: needinfo?(dothayer)
Comment 22•7 years ago
|
||
Backout by kwierso@gmail.com: https://hg.mozilla.org/integration/autoland/rev/a117fbab8be5 Backed out changeset 528d2ea1c438 for build failures in CPUUsageWatcher.h a=backout
Comment 23•7 years ago
|
||
Also blew up Android quite spectacularly. https://treeherder.mozilla.org/logviewer.html#?job_id=125708820&repo=autoland
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 26•7 years ago
|
||
Sorry for the delay on this. Here's the try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f738596accba3162b4e190f5776632e2558efeef
Flags: needinfo?(dothayer)
Comment 27•7 years ago
|
||
mozreview-review |
Comment on attachment 8904627 [details] Bug 1382440 - Fix CPUUsageWatcher on OSX and Linux https://reviewboard.mozilla.org/r/176446/#review181936 This seems like a little bit of a heavy hammer, but OK.
Attachment #8904627 -
Flags: review?(nfroyd) → review+
Comment 28•7 years ago
|
||
Pushed by dothayer@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b84af3674b49 Watch CPU usage in BHR r=froydnj https://hg.mozilla.org/integration/autoland/rev/252f4499d372 Fix CPUUsageWatcher on OSX and Linux r=froydnj
Comment 29•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/b84af3674b49 https://hg.mozilla.org/mozilla-central/rev/252f4499d372
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox57:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Assignee | ||
Comment 30•7 years ago
|
||
Well, this had somewhat of a dismal showing in yesterday's telemetry: defaultdict(int, {u'ExternalCPUHigh': 148, u'PendingInput': 1822033, u'UserInteracting': 1904376, u'pluginName': 2678, u'pluginVersion': 2678}) Trying to sort out if that's an accurate portrayal of the situation or not. Any time I compile code on my machine and check the BHR ping it shows up in all of the hangs, and I would expect that Nightly would have a higher population with this annotation given developers compiling code and doing other things while they wait, but I suppose not.
Assignee | ||
Comment 31•7 years ago
|
||
Scratch that - I thought I was using appBuildId in the script I had for this, but I was using submissionDate. Here's the correct numbers: defaultdict(int, {u'ExternalCPUHigh': 76040, u'PendingInput': 874546, u'UserInteracting': 885618, u'pluginName': 871, u'pluginVersion': 871})
You need to log in
before you can comment on or make changes to this bug.
Description
•