Closed Bug 1372042 Opened 5 years ago Closed 5 years ago

Add telemetry about running collectors during idle time

Categories

(Core :: DOM: Core & HTML, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox56 --- fixed

People

(Reporter: smaug, Assigned: smaug)

References

Details

Attachments

(2 files, 2 obsolete files)

In order to tweak various timers and budgets around GC/CC running, it might be
useful to have telemetry about how often we end up using idle time and how often timers. In general we should aim for using idle time as much as possible, but there is always the risk then that we OOM too easily when running tests which don't have any idle time, so timers are needed too.
I think I want to see how many % of forgetSkippable, CC and GC (in those cases when nsJSEnvironment triggers GC) are run during idle time
Attached patch idle_telemetry.diff (obsolete) — Splinter Review
https://treeherder.mozilla.org/#/jobs?repo=try&revision=17ace12d9c7ea2f65a2368c2991bd0b6048e047c

Scalar handling is rather broken in e10s setup, so child processes need to send the data manually to parent.
local testing hints that things are working pretty much as expected. Parent process manages to schedule most (>90%) of collector stuff using idle time, and in child processes too (>80%) in case one is browsing "normal" pages. But when running testcases like speedometer, there is way less idle time and more collectors run at other time (idle usage somewhere between 20% and 40% in child process).
Comment on attachment 8876933 [details] [diff] [review]
idle_telemetry.diff

I think this is one value I'd like to collect: % of forgetSkippable/CC/GC done in idle time.
This gives % numbers for parent process and child processes separately.
Telemetry doesn't really have anything good enough here to deal with multiple child processes in this kind of case, so the values need to be passed to parent, so the setup is annoyingly complicated.
(but this is what chutten suggested).

If you think this is reasonable, I'll ask review from some telemetry peer too to add the probes.



Another possible telemetry probe would be the max collector pause time during non-idle, but separate bug for that.
Attachment #8876933 - Flags: review?(continuation)
Comment on attachment 8876933 [details] [diff] [review]
idle_telemetry.diff

Review of attachment 8876933 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/nsJSEnvironment.cpp
@@ +229,5 @@
>  // us from triggering expensive full collections too frequently.
>  static int32_t sExpensiveCollectorPokes = 0;
>  static const int32_t kPokesBetweenExpensiveCollectorTriggers = 5;
>  
> +static TimeDuration sCCDurationTotal;

Please define a two field struct or something and use that, instead of passing around these unboxed pairs manually.

@@ +235,5 @@
> +
> +static TimeDuration sFSDurationTotal;
> +static TimeDuration sFSDurationIdle;
> +
> +static TimeDuration sGCDurationTotal;

We don't have any other GC telemetry in this file, so it seems a little odd to do just this one thing. Please at least get an ok from a GC person to do that.

@@ +249,5 @@
> +{
> +  if (!aTotalDuration) {
> +    return;
> +  }
> +  if (XRE_GetProcessType() == GeckoProcessType_Default) {

XRE_IsParentProcess() please.

@@ +251,5 @@
> +    return;
> +  }
> +  if (XRE_GetProcessType() == GeckoProcessType_Default) {
> +    double idlePercent =
> +      aTotalIdleDuration.ToMilliseconds() / aTotalDuration.ToMilliseconds() *

This is all floating point anyways, so you could just use .ToSeconds() for both of these.

@@ +1487,5 @@
> +
> +  TimeDuration idleDuration;
> +  if (!aDeadline.IsNull()) {
> +    if (aDeadline < now) {
> +      // Slice did overflow the idle period.

I think this should be something like "This slice overflowed the idle period".

@@ +1583,5 @@
> +    sCCDurationTotal += duration;
> +    TimeDuration idleDuration;
> +    if (!mIdleDeadline.IsNull()) {
> +      if (mIdleDeadline < mEndSliceTime) {
> +        // Slice did overflow the idle period.

"This slice overflowed the idle period".

@@ +1645,5 @@
>  
>    // A file to dump CC activity to; set by MOZ_CCTIMER environment variable.
>    FILE* mFile;
> +
> +  // In case CC slice was triggered during idle time, set to the end of the idle

"case CC" --> "case the CC", "set to" --> "this is set to". Or something.

@@ +2039,5 @@
> +  sGCDurationTotal += sGCUnnotifiedTotalTime;
> +  sGCUnnotifiedTotalTime = TimeDuration();
> +
> +  TimeDuration idleDuration;
> +  if (!aDeadline.IsNull()) {

Please factor out these three almost identical blocks of code into a helper method.

@@ +2041,5 @@
> +
> +  TimeDuration idleDuration;
> +  if (!aDeadline.IsNull()) {
> +    if (aDeadline < now) {
> +      // Slice did overflow the idle period.

"This slice overflowed the idle period".

::: dom/ipc/ContentParent.cpp
@@ +5335,5 @@
> +                                                const double& aTotal)
> +{
> +  TimeDuration idle;
> +  TimeDuration total;
> +  switch (Telemetry::ScalarID(aId)) {

Is this behavior defined if aId is some weird value? I'm just trying to think of what happens if the child process intentionally sends us bad data.

@@ +5369,5 @@
> +  }
> +
> +  double idlePercent =
> +    idle.ToMilliseconds() / total.ToMilliseconds() * 100;
> +  Telemetry::ScalarSet(Telemetry::ScalarID(aId), uint32_t(idlePercent));

So we store a single value for the entirety of the browsing session? Don't we send telemetry pings once a day or something? So if you keep the browser open for a week, you'll send seven pings including overlapping data? That doesn't seem good to me. Also, is overflow a concern for the static variables?

::: toolkit/components/telemetry/Scalars.yaml
@@ +500,5 @@
> +    expires: never
> +    kind: uint
> +    notification_emails:
> +      - smaug@mozilla.com
> +    release_channel_collection: opt-in

Should these be opt-out? I'm not sure exactly what the criteria for that is.
Attachment #8876933 - Flags: review?(continuation) → review+
> We don't have any other GC telemetry in this file, so it seems a little odd
> to do just this one thing. Please at least get an ok from a GC person to do
> that.
Yeah, I was thinking that, but then this is all about idle handling, and that is in nsJSEnvironment.


> > +{
> > +  TimeDuration idle;
> > +  TimeDuration total;
> > +  switch (Telemetry::ScalarID(aId)) {
> 
> Is this behavior defined if aId is some weird value?
We crash the child process in default:

> > +  double idlePercent =
> > +    idle.ToMilliseconds() / total.ToMilliseconds() * 100;
> > +  Telemetry::ScalarSet(Telemetry::ScalarID(aId), uint32_t(idlePercent));
> 
> So we store a single value for the entirety of the browsing session? Don't
> we send telemetry pings once a day or something? So if you keep the browser
> open for a week, you'll send seven pings including overlapping data? That
> doesn't seem good to me.
No? Hmm, then I would need to clear the data somehow when telemetry ping is sent.
Let me ask telemetry folks if there is some way to clear the data.


> Also, is overflow a concern for the static
> variables?
I don't think so. If it was, we would have worse issues elsewhere when using TimeStamps and TimeDuration


> @@ +500,5 @@
> > +    expires: never
> > +    kind: uint
> > +    notification_emails:
> > +      - smaug@mozilla.com
> > +    release_channel_collection: opt-in
> 
> Should these be opt-out? I'm not sure exactly what the criteria for that is.
well, opt-in is weaker, so I chose that. I'm mostly interested in Nightly data anyhow, I think.
(In reply to Olli Pettay [:smaug] from comment #6)
> Yeah, I was thinking that, but then this is all about idle handling, and
> that is in nsJSEnvironment.

I guess that makes sense.

> > Is this behavior defined if aId is some weird value?
> We crash the child process in default:
Yeah, but I don't know whether it is defined behavior to cast from an int that does not correspond to a valid enum value. I don't know if that matters in practice.

> No? Hmm, then I would need to clear the data somehow when telemetry ping is
> sent.
> Let me ask telemetry folks if there is some way to clear the data.
I'd just ask a telemetry person what the right way to do this is. Maybe it is fine to keep accumulating.
Attached patch v2 (obsolete) — Splinter Review
Our telemetry APIs don't really support the previous approach if same data shouldn't be used after ping.
So, this is different approach using histograms and reporting % of idles per slice.
Attachment #8876933 - Attachment is obsolete: true
Attachment #8877697 - Flags: review?(continuation)
Comment on attachment 8877697 [details] [diff] [review]
v2

Review of attachment 8877697 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/nsJSEnvironment.cpp
@@ +1446,5 @@
>    sTotalForgetSkippableTime += delta;
>    sRemovedPurples += (aSuspected - sPreviousSuspectedCount);
>    ++sForgetSkippableBeforeCC;
> +
> +  TimeStamp now = TimeStamp::Now();

Please factor this code out instead of having the same thing copy and pasted.

@@ +1452,5 @@
> +  if (duration.ToSeconds()) {
> +    TimeDuration idleDuration;
> +    if (!aDeadline.IsNull()) {
> +      if (aDeadline < now) {
> +        // Slice did overflow the idle period.

This comment needs to be reworded, as per my previous review.

::: toolkit/components/telemetry/Histograms.json
@@ +873,5 @@
> +    "expires_in_version": "never",
> +    "kind": "linear",
> +    "high": 100,
> +    "n_buckets": 50,
> +    "description": "Percent of CC slice done during idle time"

please say "cycle collector" instead of CC.

@@ +938,5 @@
> +    "expires_in_version": "never",
> +    "kind": "linear",
> +    "high": 100,
> +    "n_buckets": 50,
> +    "description": "Percent of forget skippable done during idle time"

Maybe start this with "Percent of the cycle collector's forget skippable" so it gives a little more context about what forgetSkippable is.
Attachment #8877697 - Flags: review?(continuation) → review-
(In reply to Andrew McCreight [:mccr8] from comment #9)

> > +
> > +  TimeStamp now = TimeStamp::Now();
> 
> Please factor this code out instead of having the same thing copy and pasted.
It is really different each time. Factoring it out doesn't unfortunately help too much -
would need to pass several timestamps and durations
Attached patch v3Splinter Review
Didn't refactor to a helper method since especially because of GC handling the helper would need to take quite some params.
Attachment #8877697 - Attachment is obsolete: true
Attachment #8877715 - Flags: review?(continuation)
Comment on attachment 8877715 [details] [diff] [review]
v3

Is this ok from telemetry point of view (or should bsmedberg or someone else review this?)
Attachment #8877715 - Flags: review?(chutten)
Comment on attachment 8877715 [details] [diff] [review]
v3

Review of attachment 8877715 [details] [diff] [review]:
-----------------------------------------------------------------

You'll need a Data Collection review as per https://wiki.mozilla.org/Firefox/Data_Collection

But it's also good to ask a telemetry client dev (like me, :gfritzsche, :Dexter) to give your approach a once-over.

And your approach seems straightforward and correct. A couple of nits/questions, but I don't expect your use of Telemetry to be problematic as written (so long as these stats collections aren't _too_ hot)

::: dom/base/nsJSEnvironment.cpp
@@ +1538,5 @@
>      mAnyLockedOut = false;
>      mExtraForgetSkippableCalls = 0;
>    }
>  
> +  void PrepareForCycleCollectionSlice(TimeStamp aDeadline = TimeStamp(),

You use mIdleDeadline instead of aDeadline in this member function

::: toolkit/components/telemetry/Histograms.json
@@ +931,5 @@
>      "high": 10000,
>      "n_buckets": 50,
>      "description": "Max time spent on one forget skippable (ms)"
>    },
> +  "FORGET_SKIPPABLE_DURING_IDLE": {

Should this be prefixed CYCLE_COLLECTOR_?

@@ +1209,5 @@
>    },
> +  "GC_SLICE_DURING_IDLE": {
> +    "record_in_processes": ["main", "content"],
> +    "alert_emails": ["dev-telemetry-gc-alerts@mozilla.org"],
> +    "expires_in_version": "never",

The Data Collection process will likely have some things to say about making new measures permanent without a trial period. It has become a pattern these days to expires_in_version 5-6 versions ahead of where it will land and file a bug to evaluate its long-term usefulness and decide its permanence at 4-5 versions from now.
Attachment #8877715 - Flags: review?(chutten) → review+
(In reply to Chris H-C :chutten from comment #13)
> You'll need a Data Collection review as per
> https://wiki.mozilla.org/Firefox/Data_Collection
ok

> >  
> > +  void PrepareForCycleCollectionSlice(TimeStamp aDeadline = TimeStamp(),
> 
> You use mIdleDeadline instead of aDeadline in this member function
No idea what this means.
PrepareForCycleCollectionSlice clearly sets 
mIdleDeadline = aDeadline;


> > +  "FORGET_SKIPPABLE_DURING_IDLE": {
> 
> Should this be prefixed CYCLE_COLLECTOR_?
nope. This is about forget skippable, which has some other probes too
Comment on attachment 8877715 [details] [diff] [review]
v3

bsmedmerg it is then.
Attachment #8877715 - Flags: review?(benjamin)
(In reply to Olli Pettay [:smaug] from comment #14)
> > > +  void PrepareForCycleCollectionSlice(TimeStamp aDeadline = TimeStamp(),
> > 
> > You use mIdleDeadline instead of aDeadline in this member function
> No idea what this means.
> PrepareForCycleCollectionSlice clearly sets 
> mIdleDeadline = aDeadline;

Oh bother, I misread this as the function definition instead of a forward decl. Completely my stupidity, feel free to ignore.
Attachment #8877715 - Flags: review?(continuation) → review+
Comment on attachment 8877715 [details] [diff] [review]
v3

>diff --git a/toolkit/components/telemetry/Histograms.json b/toolkit/components/telemetry/Histograms.json

>+  "CYCLE_COLLECTOR_SLICE_DURING_IDLE": {
>+    "record_in_processes": ["main", "content"],
>+    "alert_emails": ["dev-telemetry-gc-alerts@mozilla.org"],
>+    "expires_in_version": "never",
>+    "kind": "linear",
>+    "high": 100,
>+    "n_buckets": 50,
>+    "description": "Percent of cycle collector slice done during idle time"
>+  },

Yeah, chutten had good instincts about the expiry. If this is going to be collected forever, we need a clear plan for how it's going to be permanently monitored (and that should probably be hooked up to mission control monitoring). Otherwise let's start with 6 months/5 releases.

Also, why isn't this opt-out? If this is important enough to measure forever, it's probably important enough to monitor for release users, since prerelease is rather atypical.

Same questions below. data-review=me with an expiration of 62 and with or without opt-out. Please get back to me about a monitoring plan if you really want to land this now with no expiry.
Attachment #8877715 - Flags: review?(continuation)
Attachment #8877715 - Flags: review?(benjamin)
Attachment #8877715 - Flags: review+
Comment on attachment 8877715 [details] [diff] [review]
v3

whoops midair
Attachment #8877715 - Flags: review?(continuation) → review+
Attachment #8877715 - Flags: review+
expiration 62 sounds ok for now.
...since this is very new stuff and perhaps we'll want to collect data in some different way.
Pushed by opettay@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0b0175f6dd54
Add telemetry about running collectors during idle time, r=mccr8
https://hg.mozilla.org/mozilla-central/rev/0b0175f6dd54
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Depends on: 1373651
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.