Closed Bug 1308040 Opened 8 years ago Closed 6 years ago

Better GC telemetry

Categories

(Core :: JavaScript: GC, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: billm, Assigned: billm)

References

(Blocks 1 open bug)

Details

(Keywords: leave-open)

Attachments

(7 files, 3 obsolete files)

1.45 KB, patch
gfritzsche
: review+
jonco
: review+
Details | Diff | Splinter Review
1.35 KB, patch
sfink
: review+
Details | Diff | Splinter Review
2.11 KB, patch
sfink
: review+
Details | Diff | Splinter Review
787 bytes, patch
sfink
: review+
Details | Diff | Splinter Review
23.03 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
23.28 KB, patch
benjamin
: review+
Details | Diff | Splinter Review
42.98 KB, patch
mreid
: review+
Details | Diff | Splinter Review
We collect all sorts of data about how long GCs take, but we report almost none of it via telemetry. I would like to fix that. It would help in bug 1308039 to figure out where we should yield to painting. It will also help more generally.
Benjamin, do you have any concerns here? My thinking is that we would collect data at the level of individual GCs. The data might look like this:

{ "when": <time>, "reason": "CC_WAITING", "phases": { "mark": 22, "sweep": 12 }, "heapsize": 250 }

The numbers in "phases" are times in milliseconds. There would be a lot more phases than what I've shown here.

We wouldn't send every GC since there are too many. Probably we would send the top 5 worst GCs or something like that.

Processing would have to happen through IPython notebooks since the format doesn't match our current dashboards.
Flags: needinfo?(benjamin)
Depends on: 1308116
How will this be different from the many GC histograms that already exist (GC_*_MS)?  Would these new measures supplement or replace the old measures? And which person/team is responsible for all these metrics long-term?

In terms of data stewardship/docs, I have no problem with this plan in general. I would like to see your plan for actually using this data in a little more detail. Do you need data engineering assistance to make this into dashboards or get proper alerting? I would like to review the final docs.

There is the separate question of whether this is too much data for the pipeline (or how much it will cost). That is a conversation you should have with gfritzsche and mreid.
Flags: needinfo?(benjamin)
The main thing we lack right now is correlation data between the different measurements. For example, we have a measurement of how long the longest slice takes, but we don't know what phases were run in that slice. We separately measure how long each phase takes, but it's not broken down by slice. We could come up with histograms for all these things, but there would be too many. It's a lot easier to see exactly what happened in one bad GC.
This patch ensures that we generate JSON data for all GCs if extended telemetry is enabled. Is this the right way to do this Georg? If the user has disabled telemetry, I want to make sure we don't notify here.
Attachment #8802745 - Flags: review?(jcoppeard)
Attachment #8802745 - Flags: review?(gfritzsche)
This patch really cuts down on the size of the JSON we save. Steve, I don't know of anything that uses the JSON output right now. Will this break anything?
Attachment #8802746 - Flags: review?(sphink)
Attached patch Save GCs in telemetry (obsolete) — Splinter Review
This saves a selected set of GCs in telemetry. Hopefully the docs should explain it pretty well.
Attachment #8802747 - Flags: review?(gfritzsche)
Comment on attachment 8802746 [details] [diff] [review]
Don't output 0s in JSON

Review of attachment 8802746 [details] [diff] [review]:
-----------------------------------------------------------------

I don't know of anything that would break, and it might be better to fix anything that breaks since it tends to be a lot easier to manually read the output if the noise is suppressed. The devtools stuff *might* be consuming the JSON; I'll ni? fitzgen (who may not be the right person, but can forward the request). But I wouldn't object to landing it and seeing if anything breaks.
Attachment #8802746 - Flags: review?(sphink) → review+
fitzgen - see comment 7
Flags: needinfo?(nfitzgerald)
Devtools is using JS::dbg::GarbageCollectionEvent, not this JSON.
Flags: needinfo?(nfitzgerald)
Attachment #8802745 - Flags: review?(jcoppeard) → review+
Comment on attachment 8802747 [details] [diff] [review]
Save GCs in telemetry

Review of attachment 8802747 [details] [diff] [review]:
-----------------------------------------------------------------

Is this a temporary measurement or something we want to collect long-term?
This looks like it will add quite a bit on the payload size, so we should have a size estimate (per ping, raw size, pessimistic/upper bound) for our budget estimates and tracking.

Note that the way this is implemented, the data will only be available in "saved-session" pings and "main" pings sent on "shutdown", with the resulting latency restrictions.

In general we would like to avoid adding more custom JSON blobs where possible, for multiple reasons (code maintenance, common client logic & storage, common pipeline-side validation & tooling, ...).
Could we do record this data now in scalars (if we prioritize solving bug 1278556)?

This looks like it could map well to keyed scalars (e.g. keyed scalars like "gc.stats.max_pause" with keys like "random0", "worst0").
http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/collection/scalars.html
That way, you could always directly record the statistics in Telemetry and Telemetry would take care of sending it with the "main" pings whenever needed.

I don't want to block work here if we can't collect this data in scalars now, but would like to find out what's missing to do this.

::: toolkit/components/telemetry/GCTelemetry.jsm
@@ +54,5 @@
> +    // practice.
> +    for (let i = 0; i < this.randomlySelected.length; i++) {
> +      let r = Math.random();
> +      if (r <= prob) {
> +        this.randomlySelected[i] = data;

This would always overwrite the random samples from the latest data.
I don't know how the "garbage-collection-statistics" work - would that be a problem and not sample uniformly over the session (e.g. bias towards high activity cycles close to shutdown)?

::: toolkit/components/telemetry/docs/data/main-ping.rst
@@ +387,5 @@
> +      "random": [
> +        {
> +          // All times are in milliseconds. Absolute times are in
> +          // milliseconds since the epoch.
> +          "timestamp": 1476905859874063,

Is it useful to have absolute timestamps?
Timestamps relative to session/browser start seem sufficient and easier to work with.

@@ +411,5 @@
> +              // GC state when the slice started
> +              "initial_state": "NotActive",
> +              // GC state when the slice ended
> +              "final_state": "Mark",
> +              "budget": "10ms",

Does this need to include a unit?
Would it be easier to always have them in ms/ns/...?

::: toolkit/components/telemetry/tests/browser/browser_TelemetryGC.js
@@ +1,5 @@
> +"use strict";
> +
> +const {GCTelemetry} = Cu.import("resource://gre/modules/GCTelemetry.jsm", {});
> +
> +add_task(function* test() {

Is there a reason why this is a browser test instead of xpcshell?
Comment on attachment 8802745 [details] [diff] [review]
Notify JSON if extended telemetry enabled

Review of attachment 8802745 [details] [diff] [review]:
-----------------------------------------------------------------

This looks fine.
Note that extended Telemetry is on by default on Nightly to Beta as well, in case the impact there is relevant.
Attachment #8802745 - Flags: review?(gfritzsche) → review+
(In reply to Georg Fritzsche [:gfritzsche] from comment #10)
> Comment on attachment 8802747 [details] [diff] [review]
> Save GCs in telemetry
> 
> Review of attachment 8802747 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Is this a temporary measurement or something we want to collect long-term?

I suspect we'll want to collect this long term. GC is something we're always going to be interested in.

> This looks like it will add quite a bit on the payload size, so we should
> have a size estimate (per ping, raw size, pessimistic/upper bound) for our
> budget estimates and tracking.

I'll try to come up with something. Perhaps it would be best to limit the size of the data.

> Note that the way this is implemented, the data will only be available in
> "saved-session" pings and "main" pings sent on "shutdown", with the
> resulting latency restrictions.

Understood. I'm mainly interested in data collected for the child process, and I didn't see any mechanism to send up non-histogram subsession data from the child.

> In general we would like to avoid adding more custom JSON blobs where
> possible, for multiple reasons (code maintenance, common client logic &
> storage, common pipeline-side validation & tooling, ...).
> Could we do record this data now in scalars (if we prioritize solving bug
> 1278556)?
> 
> This looks like it could map well to keyed scalars (e.g. keyed scalars like
> "gc.stats.max_pause" with keys like "random0", "worst0").
> http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/
> collection/scalars.html
> That way, you could always directly record the statistics in Telemetry and
> Telemetry would take care of sending it with the "main" pings whenever
> needed.

I don't think that would work. It's very important to be able to see data for distinct GCs. That way you can see that, for example, GCs whose first slice was triggered by CC_WAITING took longer than those where the first slice reason was PAGE_HIDE.

> ::: toolkit/components/telemetry/GCTelemetry.jsm
> @@ +54,5 @@
> > +    // practice.
> > +    for (let i = 0; i < this.randomlySelected.length; i++) {
> > +      let r = Math.random();
> > +      if (r <= prob) {
> > +        this.randomlySelected[i] = data;
> 
> This would always overwrite the random samples from the latest data.

I don't understand. Could you explain more? The purpose of this is to overwrite randomlySelected[i] with new data with probability |prob|.

> I don't know how the "garbage-collection-statistics" work - would that be a
> problem and not sample uniformly over the session (e.g. bias towards high
> activity cycles close to shutdown)?

I don't understand what you mean. "garbage-collection-statistics" is notified for every GC. If we GC a lot during shutdown, then there's a good chance we'll see those GCs in the data. That would be useful information to have.

> ::: toolkit/components/telemetry/docs/data/main-ping.rst
> @@ +387,5 @@
> > +      "random": [
> > +        {
> > +          // All times are in milliseconds. Absolute times are in
> > +          // milliseconds since the epoch.
> > +          "timestamp": 1476905859874063,
> 
> Is it useful to have absolute timestamps?
> Timestamps relative to session/browser start seem sufficient and easier to
> work with.

Yeah, that makes sense. Do you know how to determine browser startup time?

> @@ +411,5 @@
> > +              // GC state when the slice started
> > +              "initial_state": "NotActive",
> > +              // GC state when the slice ended
> > +              "final_state": "Mark",
> > +              "budget": "10ms",
> 
> Does this need to include a unit?
> Would it be easier to always have them in ms/ns/...?

This value can either be "unlimited", "work(x)", or some number of milliseconds. So it's kind of a tagged union.

> ::: toolkit/components/telemetry/tests/browser/browser_TelemetryGC.js
> @@ +1,5 @@
> > +"use strict";
> > +
> > +const {GCTelemetry} = Cu.import("resource://gre/modules/GCTelemetry.jsm", {});
> > +
> > +add_task(function* test() {
> 
> Is there a reason why this is a browser test instead of xpcshell?

The "garbage-collection-statistics" observer is notified in nsJSEnvironment, which is only created for DOM windows. I don't think xpcshell tests have a way to create DOM windows.
(In reply to Bill McCloskey (:billm) from comment #12)
> > This looks like it will add quite a bit on the payload size, so we should
> > have a size estimate (per ping, raw size, pessimistic/upper bound) for our
> > budget estimates and tracking.
> 
> I'll try to come up with something. Perhaps it would be best to limit the
> size of the data.

If the number of slices is unbounded, we should definitely have upper limits on them.
In other places we discard when the limit is reached and put some kind of marker in to track discarding.

> > Note that the way this is implemented, the data will only be available in
> > "saved-session" pings and "main" pings sent on "shutdown", with the
> > resulting latency restrictions.
> 
> Understood. I'm mainly interested in data collected for the child process,
> and I didn't see any mechanism to send up non-histogram subsession data from
> the child.

Right. We are working to expand on the common data types available (besides histograms adding scalars & events) and have those collected in the main process without intervention.

In the medium-term we want to get rid of child payloads and live-collect in the parent process.
For that, i would like to figure out how we can also migrate this in the future.

> > This looks like it could map well to keyed scalars (e.g. keyed scalars like
> > "gc.stats.max_pause" with keys like "random0", "worst0").
> > http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/
> > collection/scalars.html
> > That way, you could always directly record the statistics in Telemetry and
> > Telemetry would take care of sending it with the "main" pings whenever
> > needed.
> 
> I don't think that would work. It's very important to be able to see data
> for distinct GCs. That way you can see that, for example, GCs whose first
> slice was triggered by CC_WAITING took longer than those where the first
> slice reason was PAGE_HIDE.

I see, two levels are harder to represent.
I think this would be possible (e.g. encoding it in the keys like "gc.stats.max_pause":{"random/0/1":23.221}), but that's not exactly elegant.

> > ::: toolkit/components/telemetry/GCTelemetry.jsm
> I don't understand. Could you explain more? The purpose of this is to
> overwrite randomlySelected[i] with new data with probability |prob|.
...
> "garbage-collection-statistics" is
> notified for every GC. If we GC a lot during shutdown, then there's a good
> chance we'll see those GCs in the data. That would be useful information to
> have.

I was reading a bit too quick. Makes sense, thanks.

> > ::: toolkit/components/telemetry/docs/data/main-ping.rst
> > @@ +387,5 @@
> > > +      "random": [
> > > +        {
> > > +          // All times are in milliseconds. Absolute times are in
> > > +          // milliseconds since the epoch.
> > > +          "timestamp": 1476905859874063,
> > 
> > Is it useful to have absolute timestamps?
> > Timestamps relative to session/browser start seem sufficient and easier to
> > work with.
> 
> Yeah, that makes sense. Do you know how to determine browser startup time?

We use nsITelemetry.msSinceProcessStart() / Telemetry::Common::MsSinceProcessStart() for this.

> > @@ +411,5 @@
> > > +              // GC state when the slice started
> > > +              "initial_state": "NotActive",
> > > +              // GC state when the slice ended
> > > +              "final_state": "Mark",
> > > +              "budget": "10ms",
> > 
> > Does this need to include a unit?
> > Would it be easier to always have them in ms/ns/...?
> 
> This value can either be "unlimited", "work(x)", or some number of
> milliseconds. So it's kind of a tagged union.

I see. Can you add that to the docs (or are there existing ones?)?
Comment on attachment 8802747 [details] [diff] [review]
Save GCs in telemetry

Review of attachment 8802747 [details] [diff] [review]:
-----------------------------------------------------------------

Note that this will also require an update to the ping schemas:
https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/telemetry/main.schema.json
https://github.com/mozilla-services/mozilla-pipeline-schemas/tree/master/validation/telemetry

::: toolkit/components/telemetry/GCTelemetry.jsm
@@ +19,5 @@
> +
> +const Cu = Components.utils;
> +
> +Cu.import("resource://gre/modules/Services.jsm", this);
> +Cu.import("resource://gre/modules/Preferences.jsm", this);

This is unused.

@@ +106,5 @@
> +
> +  entries() {
> +    return {
> +      random: GCData.randomlySelected,
> +      worst: GCData.worst,

Can we filter out `null` entries here?

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +1288,5 @@
>        payloadObj.chromeHangs = protect(() => Telemetry.chromeHangs);
>        payloadObj.threadHangStats = protect(() => this.getThreadHangStats(Telemetry.threadHangStats));
>        payloadObj.log = protect(() => TelemetryLog.entries());
>        payloadObj.webrtc = protect(() => Telemetry.webrtcStats);
> +      payloadObj.gc = protect(() => GCTelemetry.entries());

This adds DATA to all pre-release "main" pings, while only pings with "reason":"shutdown" have both parent & child data in it.
If you only intend to look at the "saved-session" or "shutdown" pings, let's make this property optional and only set it when `reason` is one of `REASON_SAVED_SESSION` or `REASON_SHUTDOWN`.

@@ +1773,5 @@
>      }
>      if (AppConstants.platform === "android") {
>        Services.obs.removeObserver(this, "application-background", false);
>      }
> +    if (Telemetry.canRecordExtended) {

.canRecordExtended can change state over the browser session.
It might be easier to always shutdown GCTelemetry here? Slightly wasteful though.

::: toolkit/components/telemetry/docs/data/main-ping.rst
@@ +383,5 @@
> +
> +.. code-block:: js
> +
> +    "gc": {
> +      "random": [

Is the number of keys in this and the individual slices bounded?
If not, let's hard-limit them and document that.

@@ +401,5 @@
> +          "nonincremental_reason": "none",
> +          "allocated": 37, // in megabytes
> +          "added_chunks": 54,
> +          "removed_chunks": 12,
> +          "slices": [

We should have a documented hard limit on the number of slices.

@@ +403,5 @@
> +          "added_chunks": 54,
> +          "removed_chunks": 12,
> +          "slices": [
> +            {
> +              "slice": 0,  // the index of this slice

Is this the same as the index of the slice in the array?

@@ +418,5 @@
> +              // absolute start/end times
> +              "start_timestamp": 1476905858444261,
> +              "end_timestamp": 1476905858467482,
> +              // time taken by each phase
> +              "times": {

Is the number of keys in here bounded?

::: toolkit/components/telemetry/tests/browser/browser_TelemetryGC.js
@@ +8,5 @@
> +  yield new Promise(resolve => SpecialPowers.pushPrefEnv(prefs, resolve));
> +
> +  let shutdown = GCTelemetry.init();
> +  registerCleanupFunction(() => {
> +    if (shutdown) {

Nit: GCTelemetry.shutdown() gracefully handles this, so this doesn't seem needed.

@@ +36,5 @@
> +  }
> +
> +  for (let k of Object.keys(entries)) {
> +    ok(FIELDS.includes(k), `${k} found in FIELDS`);
> +  }

Also check that `entries` only contains these fields.

@@ +51,5 @@
> +      }
> +
> +      foundGCs++;
> +
> +      // Sanity check the GC data.

Is it possible to check all the fields, required and optional?
Can we sanity-check the number of keys < $some_limit?

@@ +55,5 @@
> +      // Sanity check the GC data.
> +      ok("total_time" in gc, "total_time field present");
> +      ok("max_pause" in gc, "max_pause field present");
> +      ok("slices" in gc, "slices field present");
> +      ok(Array.isArray(gc.slices), "slices is an array");

We should check for the size limit here.

@@ +58,5 @@
> +      ok("slices" in gc, "slices field present");
> +      ok(Array.isArray(gc.slices), "slices is an array");
> +      ok(gc.slices.length > 0, "slices array non-empty");
> +
> +      for (let slice of gc.slices) {

Ditto:
Is it possible to check all the fields, required and optional?
Can we sanity-check the number of keys < $some_limit?
Attachment #8802747 - Flags: review?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] [away Oct 21 - 24] from comment #13)
> > > Note that the way this is implemented, the data will only be available in
> > > "saved-session" pings and "main" pings sent on "shutdown", with the
> > > resulting latency restrictions.
> > 
> > Understood. I'm mainly interested in data collected for the child process,
> > and I didn't see any mechanism to send up non-histogram subsession data from
> > the child.
> 
> Right. We are working to expand on the common data types available (besides
> histograms adding scalars & events) and have those collected in the main
> process without intervention.
> 
> In the medium-term we want to get rid of child payloads and live-collect in
> the parent process.
> For that, i would like to figure out how we can also migrate this in the
> future.
> 
> > > This looks like it could map well to keyed scalars (e.g. keyed scalars like
> > > "gc.stats.max_pause" with keys like "random0", "worst0").
> > > http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/
> > > collection/scalars.html
> > > That way, you could always directly record the statistics in Telemetry and
> > > Telemetry would take care of sending it with the "main" pings whenever
> > > needed.
> > 
> > I don't think that would work. It's very important to be able to see data
> > for distinct GCs. That way you can see that, for example, GCs whose first
> > slice was triggered by CC_WAITING took longer than those where the first
> > slice reason was PAGE_HIDE.
> 
> I see, two levels are harder to represent.
> I think this would be possible (e.g. encoding it in the keys like
> "gc.stats.max_pause":{"random/0/1":23.221}), but that's not exactly elegant.

Chris, what do you think about this (and possibly similar, more extensive, performance data)?
I don't have a good idea off-hand about handling this or what similar patterns we have.

Attachment 8802747 [details] [diff] has a main-ping.rst update with the shape of the data, for easier parsing.
Flags: needinfo?(chutten)
Attached patch fix totals arraySplinter Review
This bug was causing the "totals" section of the JSON output to be all zero. We need to wait until after the slice callback to reset the totals.
Attachment #8803214 - Flags: review?(sphink)
This callback here is what we use to paint during tab switch. We don't want to count that time against the GC. Are these the right functions to use?
Attachment #8803215 - Flags: review?(sphink)
Attached patch Save GCs in telemetry, v2 (obsolete) — Splinter Review
This new version addresses the review comments. I also converted the code so that child process data is sent up to the parent within GCTelemetry.jsm. That allowed me to reset the data for each subsession. I think this will be valuable since people with long sessions often have the worst GC times.
Attachment #8802747 - Attachment is obsolete: true
Attachment #8803217 - Flags: review?(gfritzsche)
*reads the wall o' text*

This concept seems so specialized... it seems as though it really only wants Telemetry for its network connection and reliable delivery. Which is fine... but r- from me if you don't have the child->parent aggregation thought out completely :)

I think that this isn't a problem we can or should solve with our general-purpose telemetry mechanics. It requires broad swathes of precise data, tightly coupled. So not histograms, and not scalars. And little we do to satisfy any of those requirements will be able to be reused because it's specialized.

Which means a bunch of specialized handling... which we were already going to have to do if we tried to encode the data awkwardly as hgrams or scalars.

tl;dr - With what I can glean from reading this bug, I think the original idea of "JSON blob with custom Spark handling" is unavoidable.
Flags: needinfo?(chutten)
Comment on attachment 8803214 [details] [diff] [review]
fix totals array

Review of attachment 8803214 [details] [diff] [review]:
-----------------------------------------------------------------

Oh, right!
Attachment #8803214 - Flags: review?(sphink) → review+
Comment on attachment 8803217 [details] [diff] [review]
Save GCs in telemetry, v2

Review of attachment 8803217 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/GCTelemetry.jsm
@@ +32,5 @@
> +// Names of processes where we record GCs.
> +const PROCESS_NAMES = ["main", "content"];
> +
> +// Should be the time we started up in milliseconds since the epoch.
> +const BASE_TIME = Date.now() - Services.telemetry.msSinceProcessStart();

Note that Date.now() is not monotonic - might unstable clocks (jumping forward/backward) affect the calculation below?
Would it make sense to already record the timestamps as relative in the GC code?

@@ +109,5 @@
> +const MAX_SLICES = 15;
> +
> +function limitSize(data) {
> +  // Store the number of slices so we know if we lost any at the end.
> +  data.num_slices = data.slices.length;

Can we also limit the phases key count to some sane upper limit (e.g. 100 or 200, doesn't have to be the expected 65 exactly)?
This would protect us from very high payload sizes (and subsequently pings getting discarded etc.) in case of bugs or unexpected behavior.

@@ +156,5 @@
> +    Services.obs.removeObserver(this, "garbage-collection-statistics");
> +    Services.obs.removeObserver(this, "internal-telemetry-after-subsession-split");
> +
> +    if (Services.appinfo.processType == Services.appinfo.PROCESS_TYPE_DEFAULT) {
> +      Services.ppmm.removeMessageListener("Telemetry:GC", this);

Can this be more descriptive, e.g. "Telemetry:GCStatistics"?

@@ +162,5 @@
> +    this.initialized = false;
> +  },
> +
> +  observe(subject, topic, arg) {
> +    if (topic == "internal-telemetry-after-subsession-split") {

I think this is not needed.
Instead, let's make `entries()` reset (always or optionally).
Then TelemetrySession can collect & reset the GC data.

Note that with the clearing out, now only the "main" pings will be representative for analysis.

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +1775,5 @@
>      }
>      if (AppConstants.platform === "android") {
>        Services.obs.removeObserver(this, "application-background", false);
>      }
> +    GCTelemetry.shutdown();

This changes the meaning of this function a bit.
Lets adjust the comment accordingly (that this is about shutdown operations common to all process types).

::: toolkit/components/telemetry/tests/browser/browser_TelemetryGC.js
@@ +31,5 @@
> +
> +      ok("slices" in gc, "slices field present");
> +      ok(Array.isArray(gc.slices), "slices is an array");
> +      ok(gc.slices.length > 0, "slices array non-empty");
> +      ok(gc.slices.length <= 20, "slices array is not too long");

This should be `<= 15`.

@@ +60,5 @@
> +            ok(typeof(slice[key]) != "object", `${key} property should be primitive`);
> +          }
> +        }
> +
> +        ok(Object.keys(slice.times).length < 64, "no more than 64 phases");

The doc mentions "<= 65 possible phases", here it is "< 64"?

@@ +95,5 @@
> +
> +  function initScript() {
> +    const {GCTelemetry} = Components.utils.import("resource://gre/modules/GCTelemetry.jsm", {});
> +
> +    /* Don't shut down GC telemetry if it was already running before the test! */

Nit: Use "//".
Attachment #8803217 - Flags: review?(gfritzsche) → feedback+
> Note that Date.now() is not monotonic - might unstable clocks (jumping
> forward/backward) affect the calculation below?
> Would it make sense to already record the timestamps as relative in the GC
> code?

That probably makes sense. However, all the GC timing infrastructure is based on PRMJ_Now, which is non-monotonic. Changing that would be a large amount of work and I think it's out of scope for this bug. I filed bug 1312908 for that.

Otherwise I think I've addressed all your comments. I don't think there's much I can do about only being able to use "main" pings. I guess we could keep two copies of all the GC data (session and subsession). That seems like overkill though.
Attachment #8803217 - Attachment is obsolete: true
Attachment #8804512 - Flags: review?(gfritzsche)
(In reply to Bill McCloskey (:billm) from comment #22)
> Otherwise I think I've addressed all your comments. I don't think there's
> much I can do about only being able to use "main" pings. I guess we could
> keep two copies of all the GC data (session and subsession). That seems like
> overkill though.

Right, using "main" pings is what we prefer anyway (saved-session will go away in the future), i just wanted to call it out for the analysis work.
Comment on attachment 8804512 [details] [diff] [review]
Save GCs in telemetry, v3

Review of attachment 8804512 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, thanks!

::: toolkit/components/telemetry/tests/browser/browser_TelemetryGC.js
@@ +129,5 @@
> +
> +  let localPromise = new Promise(resolve => {
> +    function obs() {
> +      Services.obs.removeObserver(obs, "garbage-collection-statistics");
> +      dump("local resolved\n");

Left-over?
Attachment #8804512 - Flags: review?(gfritzsche) → review+
As next steps we would need:
* an upper-bound/pessimistic storage estimation (raw JSON text size) for mreid
* after landing, updating the ping schemas with `gc` as an optional property (r?mreid):
  - https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/telemetry/main.schema.json
  - https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/validation/telemetry/sample_v4_ping.json
Comment on attachment 8803215 [details] [diff] [review]
don't count painting time in GC stats

Review of attachment 8803215 [details] [diff] [review]:
-----------------------------------------------------------------

Heh. That's not quite what those functions were added for, in fact in a way they're the opposite, but I think they're fine for your purpose.

Normally, suspendPhases() is called when you *want* to permit GC to happen -- eg, in JSGC_END, you want to be able to do a whole new GC, but phases validate that they only appear one place in the hierarchy. So if you didn't do anything, it would assert at the first GC phase, since that phase will be marked with PHASE_NO_PARENT. This has the side effect of not accumulating time during GCs within JSGC_END (non-GC code *is* counted, not that it matters here).

But the side effect is what you want, and it semantically makes sense to say that you're not spending time in GC while processing the GC interrupt, so this seems good. And losing the phase nesting assertion is fine, since you have other asserts that will catch the same thing.
Attachment #8803215 - Flags: review?(sphink) → review+
Attached patch Save GCs in telemetry, v4 (obsolete) — Splinter Review
Here's an updated patch that saves less data than before. I reduced the number of GCs saved and the number of slices saved per GC.
Attachment #8804985 - Flags: review+
Mark, this patch increases the size of our main telemetry ping. I tested it for about half an hour and it increased the size of the JSON by about 15%. That's probably a reasonable average estimate.

For a worst-case estimate based on the limits imposed in the patch, I get:
  (6 GCs saved per ping)
* [  (20 fields of per-GC overhead)*(20 bytes per field)
   + (65 phases in "totals")*(30 bytes per phase)
   + (4 slices per GC)
   * [  (10 fields of per-slice overhead)*(20 bytes per field)
      + (65 phases)*(30 bytes per phase)
     ]
   ]
= 6 * [ 20*20 + 65*30 + 4*[10*20 + 65*30] ]
= 65700 bytes

So the most that a ping should increase by is 65700 bytes. Is there any more information you need before I land this?
Flags: needinfo?(mreid)
Note that this Telemetry addition also still needs data review from one of the peers listed here:
https://wiki.mozilla.org/Firefox/Data_Collection
Just to confirm, the new gc data is to be reported for extended telemetry only?

The worst-case estimate seems fairly heavy (that would be about a 50% increase), but not outlandish.

That amount of extra data should not cause any technical problems. At the worst case, it would amount to about an extra 1TB of uncompressed data per day. That should not have major budget implications - even uncompressed that's under $1000/month, so with compression it should be much lower.  

Actually, it'd be about double that if it's going to be included in *both* main and saved-session pings. Do we need this data in both ping types?

I would also like to see a sketch of the data analysis and presentation/visualization, but that can be addressed in a separate bug.
Flags: needinfo?(mreid)
Comment on attachment 8804985 [details] [diff] [review]
Save GCs in telemetry, v4

Requesting data review. There are docs in the patch.
Attachment #8804985 - Flags: review?(francois)
(In reply to Mark Reid [:mreid] from comment #30)
> Just to confirm, the new gc data is to be reported for extended telemetry
> only?

Yes.

> Actually, it'd be about double that if it's going to be included in *both*
> main and saved-session pings. Do we need this data in both ping types?

No. The data is only recorded for main pings.
(In reply to Bill McCloskey (:billm) from comment #32)
> (In reply to Mark Reid [:mreid] from comment #30)
> > Just to confirm, the new gc data is to be reported for extended telemetry
> > only?
> 
> Yes.
> 
> > Actually, it'd be about double that if it's going to be included in *both*
> > main and saved-session pings. Do we need this data in both ping types?
> 
> No. The data is only recorded for main pings.

Sounds good on both counts.
Actually, I just realized that Benjamin is a data peer. I already talked over this change with him, but maybe it needs a closer review? Can you sign off on this bsmedberg?
Flags: needinfo?(benjamin)
Benjamin said he would need to review the actual code.
Flags: needinfo?(benjamin)
Comment on attachment 8804985 [details] [diff] [review]
Save GCs in telemetry, v4

For data review. Please see the .rst documentation in the patch.
Attachment #8804985 - Flags: review?(francois) → review?(benjamin)
Rebased and with updated documentation (I changed a few of the constants and forgot to update the docs).
Attachment #8804985 - Attachment is obsolete: true
Attachment #8804985 - Flags: review?(benjamin)
Attachment #8806026 - Flags: review?(benjamin)
Comment on attachment 8806026 [details] [diff] [review]
Save GCs in telemetry, v5

>diff --git a/toolkit/components/telemetry/docs/data/main-ping.rst b/toolkit/components/telemetry/docs/data/main-ping.rst

>+    "gc": {
>+      "random": [
>+        {
>+          // Timestamps are in milliseconds since startup.

Please include somewhere a note about this being wall-clock times and not monotonic clock. I think in general the telemetry docs assume that all intervals are monotonic unless otherwise documented.

>+          "nonincremental_reason": "none",

What are the possible values here? Can we document them?

Thank you for testing the telemetry! It makes me super happy to see that.
Attachment #8806026 - Flags: review?(benjamin) → review+
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/59b82ce58741
Notify JSON if extended telemetry enabled (r=jonco,gfritzsche)
https://hg.mozilla.org/integration/mozilla-inbound/rev/150fb34c63a1
Don't output 0s in GC JSON (r=sfink)
https://hg.mozilla.org/integration/mozilla-inbound/rev/6964ddd8a364
Fix GC JSON totals array (r=sfink)
https://hg.mozilla.org/integration/mozilla-inbound/rev/54da3269c276
Don't count painting time in GC telemetry (r=sfink)
https://hg.mozilla.org/integration/mozilla-inbound/rev/1a44a71f1d22
Save GCs in telemetry (r=gfritzsche,bsmedberg)
This updates the JSON schema files. I wasn't sure how strict to be. I tried to encode all the size limitations that we have in the code and tests. I guess the danger here is that we'll be too strict and drop the entire ping (I assume that's what this schema is used for). Hopefully the restrictions in the client code will prevent that. Do we have any mechanism to look at the rate at which pings are dropped to make sure there isn't a problem?
Attachment #8806116 - Flags: review?(mreid)
Comment on attachment 8806116 [details] [diff] [review]
JSON schema change

Looks good, please submit a PR to https://github.com/mozilla-services/mozilla-pipeline-schemas
Attachment #8806116 - Flags: review?(mreid) → review+
FYI, so far it appears that submissions from builds on/after this change are on the order of 20kb larger in raw size.

See https://metrics.services.mozilla.com/telemetry-budget-dashboard/ for details.
Depends on: 1330901
Mark, you reviewed a patch to the schema here. Did it land? Do you want it?
Flags: needinfo?(mreid)
It did not land, and we do want it! I've created a PR to the schemas repo above:

https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/130
Flags: needinfo?(mreid)
\o/ OK! Looks like we can close this, then.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: