Default histogram recording to main & content process

RESOLVED FIXED in Firefox 56

Status

()

defect
P1
normal
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: gfritzsche, Assigned: chutten)

Tracking

Trunk
mozilla56
Points:
2
Dependency tree / graph

Firefox Tracking Flags

(firefox54 wontfix, firefox55 fix-optional, firefox56 fixed)

Details

(Whiteboard: [measurement:client])

Attachments

(2 attachments, 3 obsolete attachments)

Similar to bug 1333797 for scalars, we should limit the recorded histograms to specific process types before the GPU process starts to roll out to broader populations [1].

Proposed plan:
- add a required field `record_in_processes` to Histograms.json
- for all existing histograms, fill in `record_in_processes: ["main", "content"]`
- reach out to see which histograms need GPU process recording (:dvander, :ahughes?)
- make sure the new field doesn't surprise/break any data pipeline jobs

1: https://ashughes.com/?p=374
Priority: P2 → P1
The following non-flag histograms were present in payload.processes.gpu amongst Nightly pings submitted on April 18:

CHECKERBOARD_DURATION
CHECKERBOARD_PEAK
CHECKERBOARD_POTENTIAL_DURATION
CHECKERBOARD_SEVERITY
COMPOSITE_FRAME_ROUNDTRIP_TIME
COMPOSITE_TIME
COMPOSITOR_ANIMATION_DURATION
COMPOSITOR_ANIMATION_MAX_CONTIGUOUS_DROPS_APZ
COMPOSITOR_ANIMATION_MAX_CONTIGUOUS_DROPS_CHROME
COMPOSITOR_ANIMATION_MAX_CONTIGUOUS_DROPS_CONTENT
COMPOSITOR_ANIMATION_MAX_LAYER_AREA
COMPOSITOR_ANIMATION_THROUGHPUT_APZ
COMPOSITOR_ANIMATION_THROUGHPUT_CHROME
COMPOSITOR_ANIMATION_THROUGHPUT_CONTENT
CONTENT_RESPONSE_DURATION
D3D11_SYNC_HANDLE_FAILURE
DEVICE_RESET_REASON
FORCED_DEVICE_RESET_REASON
GPU_PROCESS_INITIALIZATION_TIME_MS
MEDIA_DECODER_BACKEND_USED
MEDIA_WMF_DECODE_ERROR
SCROLL_INPUT_METHODS

I expect this is a superset of hgrams we care about having from GPU processes, but we can start here.

(( As for flag histograms, there were none submitted on that day that had a non-false value. I'm thinking we successfully kept them from infecting gpu. ))

Hrm. I wonder if there's any code that expects flag histograms to have a valid value on every process or if I'll be able to cleanly skip all of them when reflecting to JS...
Assignee: nobody → chutten
Status: NEW → ASSIGNED
Almost forgot gpu's keyed histograms:

[u'IPC_READ_MAIN_THREAD_LATENCY_MS',
 u'IPC_WRITE_LATENCY_MS',
 u'D3D11_COMPOSITING_FAILURE_ID',
 u'OPENGL_COMPOSITING_FAILURE_ID',
 u'CANVAS_WEBGL_ACCL_FAILURE_ID',
 u'IPC_MESSAGE_SIZE',
 u'IPC_READ_LATENCY_MS',
 u'IPC_WRITE_MAIN_THREAD_LATENCY_MS',
 u'MAIN_THREAD_RUNNABLE_MS']
IPC_{READ,WRITE}_LATENCY_MS are vestigial from bug 1342635 and can be ignored.
Comment on attachment 8866031 [details]
bug 1335343 - Add initial record_in_processes support

https://reviewboard.mozilla.org/r/137624/#review140994

This looks good to me, thanks. Great job in adding all the entries to the Histograms.json file. I'm holding back the r+ until the issue below is cleared.

::: toolkit/components/telemetry/histogram_tools.py:168
(Diff revision 1)
>  
>      def dataset(self):
>          """Returns the dataset this histogram belongs into."""
>          return self._dataset
>  
>      def labels(self):

I think you also need to add getters here. See the scalars implementation [here](http://searchfox.org/mozilla-central/rev/7057a51c5cf29b5b115b1db19ace2cfe3fd83a0e/toolkit/components/telemetry/parse_scalars.py#241,246). You probably just need to add one to get the enum needed for the implementation.

Are doc updates coming as a separate patch?
Comment on attachment 8866031 [details]
bug 1335343 - Add initial record_in_processes support

https://reviewboard.mozilla.org/r/137624/#review141086

This looks good now, thanks! Let's not forget to update the docs later!
Attachment #8866031 - Flags: review?(alessio.placitelli) → review+
Pushed by chutten@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/69ddf4e384ab
Add initial record_in_processes support r=Dexter
Ack. I knew I forgot something.
Flags: needinfo?(chutten)
Pushed by chutten@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ed073a4c343e
Add initial record_in_processes support r=Dexter
Can you send an update to fhr-dev for this?
You could follow up on the thread: "Making the process choice required for Histograms.json".
Flags: needinfo?(chutten)
Flags: needinfo?(chutten)
Keywords: leave-open
I missed that we may have to update data job dependencies for this.
I worry whether this breaks python_moztelemetry or the aggregator (see bug 1343855, comment 8 ff).
Non-blocking, but this broke the (non-critical) probe-scraper [1].

1: https://nbviewer.jupyter.org/url/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/gfritzsche-telemetry-probe-scraper/data/load_and_run.ipynb
We should just ignore/allow any additional properties when `self._strict_type_checks` is not set (meaning the script is used in data jobs).
That way we won't run into this repeatedly.
Depends on: 1363973
Depends on: 1363934
Blocks: 1364362
Depends on: 1364207
Depends on: 1364393
No longer depends on: 1363973
Here's the enforcement patch, with tests.
Attachment #8869157 - Flags: review?(alessio.placitelli)
Comment on attachment 8869157 [details] [diff] [review]
0001-bug-1335343-Prevent-histogram-recording-in-disallowe.patch

Review of attachment 8869157 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks Chris, my comments mostly relate to testing:

- we could make the test probes more straightforward and cover the rest of the processes flags;
- we could test it in test_ChildHistograms, since we have a content process there.

::: toolkit/components/telemetry/Histograms.json
@@ +7561,4 @@
>      "kind": "flag",
>      "description": "a testing histogram; not meant to be touched"
>    },
> +  "TELEMETRY_TEST_NOT_MAIN": {

nit: what about renaming this to TELEMETRY_TEST_CONTENT_PROCESS?

@@ +7571,5 @@
> +    "n_buckets": 10,
> +    "bug_numbers": [1335343],
> +    "description": "a testing histogram; not meant to be touched"
> +  },
> +  "TELEMETRY_TEST_KEYED_NOT_MAIN": {

nit: TELEMETRY_TEST_KEYED_CONTENT_PROCESS

@@ +7583,5 @@
> +    "keyed": true,
> +    "bug_numbers": [1335343],
> +    "description": "a testing histogram; not meant to be touched"
> +  },
> +  "TELEMETRY_TEST_FLAG_NOT_MAIN": {

nit: TELEMETRY_TEST_FLAG_CONTENT_PROCESS

@@ +7597,5 @@
> +    "alert_emails": ["telemetry-client-dev@mozilla.com"],
> +    "expires_in_version": "never",
> +    "kind": "flag",
> +    "bug_numbers": [1335343],
> +    "description": "a testing histogram; not meant to be touched"

Can we also test 'all' and 'all_childs'?

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +131,4 @@
>    uint32_t dataset;
>    uint32_t label_index;
>    uint32_t label_count;
> +  mozilla::Telemetry::Common::RecordedProcessType record_in_processes;

nit: we can add an 'using' directive for RecordedProcessType around line ~40 to make this more readable

@@ +1955,5 @@
>      return;
>    }
>  
> +  const HistogramInfo& h = gHistograms[aID];
> +  if (!CanRecordInProcess(h.record_in_processes, XRE_GetProcessType())) {

nit: can you drop a comment here to mention that we do that to prevent enabling recording from the wrong processes?

@@ +1984,2 @@
>    Histogram *h;
> +  rv = internal_GetHistogramByName(id, &h);

This ends up calling |internal_GetHistogramEnumId| again :( I don't think that would bite performances, as SetHistogramRecordingEnabled shouldn't be called on an hot-path. Can you confirm (or refactor this part to call internal_GetHistogramEnumId once)?

::: toolkit/components/telemetry/histogram-whitelists.json
@@ +2020,4 @@
>      "TELEMETRY_TEST_COUNT_INIT_NO_RECORD",
>      "TELEMETRY_TEST_EXPIRED",
>      "TELEMETRY_TEST_FLAG",
> +    "TELEMETRY_TEST_FLAG_NOT_MAIN",

Can we use some other histogram type so we don't have to add exceptions to the whitelist?

::: toolkit/components/telemetry/tests/unit/test_TelemetryHistograms.js
@@ +939,5 @@
>    subsession = h.subsessionSnapshot();
>    Assert.ok(!(KEY in subsession));
>    Assert.equal(h.subsessionSnapshot(KEY).sum, 0);
> +},
> +function* test_record_in_processes() {

Can you put this in a separate add_task()?

@@ +964,5 @@
> +
> +  // Flag histogram disallowed in content still works in main.
> +  content.add(1);
> +  Assert.ok("TELEMETRY_TEST_FLAG_NOT_CONTENT" in Telemetry.histogramSnapshots);
> +  Assert.equal(content.snapshot().sum, 1);

All you can test here is that accumulations from the main process don't make it into "content" histograms.

Wouldn't test_ChildHistogram.js be a more suitable place for this? That way you could check that accumulations from content processes don't make it into main processes histograms, etc.
Attachment #8869157 - Flags: review?(alessio.placitelli)
I also fleshed out the commit message to be clear on a couple of design necessities.
Attachment #8869157 - Attachment is obsolete: true
Attachment #8869570 - Flags: review?(alessio.placitelli)
Comment on attachment 8869570 [details] [diff] [review]
0001-bug-1335343-Prevent-histogram-recording-in-disallowe.patch

Review of attachment 8869570 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good, thanks. Other than the nit and the test request, I'm just holding back the r+ to understand what's the assumption you're referring to in test_ChildHistograms.js.

Keyed histograms should not be present if empty, regardless of the process. Am I missing something there?

Please also make sure this merges correctly (I'm looking at you TelemetryHistogram.cpp) due to bug 1361661 landing at the same time.

::: toolkit/components/telemetry/tests/unit/test_ChildHistograms.js
@@ +77,4 @@
>                 "Keyed flag test histogram should have the right value.");
>    Assert.equal(kh["TELEMETRY_TEST_KEYED_FLAG"]["b"].sum, 1,
>                 "Keyed flag test histogram should have the right value.");
> +

nit: extra blank line

@@ +135,5 @@
>  
> +  // Check record_in_processes
> +  // Content Process
> +  let hs = payload.processes.content.histograms;
> +  let khs = payload.processes.content.keyedHistograms;

Can we also check if the stored values match with the expected values that we set early in the test? For both the content and the parent process

@@ +148,5 @@
> +  let mainHs = payload.histograms;
> +  let mainKhs = payload.keyedHistograms;
> +  Assert.ok(!("TELEMETRY_TEST_CONTENT_PROCESS" in mainHs), "Should not have content process histogram in main process payload");
> +  //Assert.ok(!("TELEMETRY_TEST_KEYED_CONTENT_PROCESS" in mainKhs), "Should not have keyed content process histogram in main process payload");
> +  // Unfortunately, keyed histograms must be present due to assumptions in TelemetrySession.getKeyedHistograms

Which assumption? What's the problem here? We should not be reporting any empty keyed histogram, regardless of the process. Is this an issue here?
Attachment #8869570 - Flags: review?(alessio.placitelli) → feedback+
You are of course correct. I thought because of the TelemetrySession assumptions that underlies init code in TelemetryHistogram[1] which gives us the assumption in TelemetrySession that any registered keyedHistogram exists[2] it meant that they existed in the payload.

Here's the fix, including a fix to an issue the test uncovered where adding to a keyed histogram you aren't allowed to record into would populate the key, only failing to put the value into the histogram. (Which, of course, only happens in the parent process, because of course it wasn't evenly prohibitive)

Telemetry tests and lint pass. Ready for another look. (Adding it to :gfritzsche's queue as well due to a holiday).

[1]: http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/TelemetryHistogram.cpp#1854
[2]: http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/TelemetrySession.jsm#954
Attachment #8869570 - Attachment is obsolete: true
Attachment #8873584 - Flags: review?(gfritzsche)
Attachment #8873584 - Flags: review?(alessio.placitelli)
Attachment #8873584 - Flags: review?(gfritzsche)
Comment on attachment 8873584 [details] [diff] [review]
0001-bug-1335343-Prevent-histogram-recording-in-disallowe.patch

Review of attachment 8873584 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good now, thanks Chris.

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +2244,5 @@
>      mozilla::Telemetry::HistogramID id = mozilla::Telemetry::HistogramID(i);
>  
>      for (uint32_t process = 0; process < static_cast<uint32_t>(ProcessID::Count); ++process) {
> +      if (!CanRecordInProcess(hi.record_in_processes, ProcessID(process)) ||
> +        ((ProcessID(process) == ProcessID::Gpu) && !includeGPUProcess)) {

nit: indentation - I think this needs to be aligned to the "!" on the line above
Attachment #8873584 - Flags: review?(alessio.placitelli) → review+
Note: given the "Nightly" freeze thing, I'm not sure if this is allowed to land before the uplift. Do you mind checking this out?
Flags: needinfo?(chutten)
(In reply to Alessio Placitelli [:Dexter] from comment #23)
> Comment on attachment 8873584 [details] [diff] [review]
> nit: indentation - I think this needs to be aligned to the "!" on the line
> above

The multi-line conditional just below that begs to differ :)

(In reply to Alessio Placitelli [:Dexter] from comment #24)
> Note: given the "Nightly" freeze thing, I'm not sure if this is allowed to
> land before the uplift. Do you mind checking this out?

Yeah... I'm not the happiest about leaving this to bitrot for a week, but I guess that's the world we live in now. The criteria pretty clearly ask us to err on the side of caution with something like this.
Flags: needinfo?(chutten)
Comment on attachment 8873584 [details] [diff] [review]
0001-bug-1335343-Prevent-histogram-recording-in-disallowe.patch

Review of attachment 8873584 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/histogram_tools.py
@@ +120,4 @@
>          self._keyed = definition.get('keyed', False)
>          self._expiration = definition.get('expires_in_version')
>          self._labels = definition.get('labels', [])
> +        self._record_in_processes = definition['record_in_processes']

Sorry about this late request, I didn't catch it last time and I had a second look at it due to the failures we're seeing in bug 1368713.

This will throw a KeyError if 'record_in_processes' is not in definition. Maybe the safest path here is to use definition.get('record_in_processes', ['main'] if not self._is_use_counter else ['main', 'content']).

If we do that, I think we could even get rid of the 'use_counters' special case a few lines above.

In any case, we should be updating the docs to mention the defaults here: https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/collection/histograms.html#record-in-processes
It should probably be ok to default to 'main' and 'content'
Luckily, we didn't land this yet :)

Chris, would you mind fixing the stuff from comment 26 before landing this second part? Rajiit found about it in bug 1368713 since the tests we just added were failing... so I guess we're on the right path with that patch \o/
Flags: needinfo?(chutten)
I don't want there to be defaults for this key. It's required.

So I'll go with get('record_in_processes', None) and then it'll fail its type checks on the client yet succeed on the server and for use counters.
Flags: needinfo?(chutten)
I also rebased it atop central while I was looking at it.
Attachment #8873584 - Attachment is obsolete: true
Attachment #8877157 - Flags: review?(alessio.placitelli)
Attachment #8877157 - Flags: review?(alessio.placitelli) → review+
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8194cd1a678d
Prevent histogram recording in disallowed processes. r=Dexter
Keywords: checkin-needed
Blocks: 1368713
https://hg.mozilla.org/mozilla-central/rev/8194cd1a678d
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Depends on: 1373083
Mark 54 won't fix as 54 was released.
Duplicate of this bug: 1281791
You need to log in before you can comment on or make changes to this bug.