Closed Bug 1297790 Opened 3 years ago Closed 3 years ago

Telemetry for GPU Process Behavior

Categories

(Core :: Graphics, defect)

defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: dvander, Assigned: gw280)

References

Details

(Whiteboard: [gfx-noted])

Attachments

(1 file, 7 obsolete files)

We should have some very basic metrics for the GPU process:

(1) How long it takes to start up, from initial launch to the "GPUReady" ACK.
(2) How many times we create the GPU Process (right now, just one)
(3) How many times it crashes/dies, via an unexpected GPUChild::ActorDestroy.

There may be more specific things we want to know inside the GPU process, but nothing comes to mind yet.
And... (4) An entry in the telemetry environment that contains whether the GPU process is available/enabled/disabled on the platform. We do this for D3D11/D3D9/OGL etc by getting the gfxConfig feature state. That's probably what we want here too.
Whiteboard: [gfx-noted]
Assignee: nobody → gwright
Attachment #8804847 - Flags: feedback?(benjamin)
Attachment #8804848 - Flags: feedback?(benjamin)
Comment on attachment 8804848 [details] [diff] [review]
0002-Bug-1297790-Add-telemetry-probes-for-GPU-process-lau.patch

Review of attachment 8804848 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Histograms.json
@@ +8634,4 @@
>      "releaseChannelCollection": "opt-out",
>      "description": "Counts of plugin and content process crashes which are reported with a crash dump."
>    },
> +  "SUBPROCESS_LAUNCH_COUNT": {

Do we really need a keyed histogram here? Unless someone else has requested this data for another process type, I'd just name it GPU_PROCESS_LAUNCH_COUNT.
Fair comment. I figured I'd follow the existing pattern laid by SUBPROCESS_ABNORMAL_ABORT, but I'm not opposed to just having GPU_PROCESS_LAUNCH_COUNT either.
Comment on attachment 8804847 [details] [diff] [review]
0001-Bug-1297790-Add-GPU-process-feature-state-to-the-Tel.patch

Review of attachment 8804847 [details] [diff] [review]:
-----------------------------------------------------------------

It would be better to use the built-in feature reporting functionality, for example:

http://searchfox.org/mozilla-central/rev/8cf1367dd89cc36ef8f025dfc6af6d5c086838a7/widget/windows/GfxInfo.cpp#1399

InitFeatureObject takes in a name and status and automatically adds some information to Telemetry, including some strings about why stuff failed. You might have to change it so that the nsIGfxInfo part is optional - there's no nsIGfxInfo blocklisting for the GPU process (yet).
Attachment #8804847 - Flags: review?(dvander)
Attachment #8804847 - Flags: feedback?(benjamin)
Aha, I didn't notice http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/TelemetryEnvironment.jsm#1070 before; I thought the gfxInfo feature stuff only affected other things like crash reports.

Mostly just making this comment so that if anybody else stumbles on it they know about this.
Came up in meeting: We should also have telemetry for when the GPU process fails to get a D3D11 device, since this could imply we're losing acceleration.
We had a separate request to record the number of content processes we launched, so that we could end up with statistics like total number of content processes that terminated "normally" versus abnormally.

If SUBPROCESS_LAUNCH_COUNT is correct for all process types, that's great. But if we're only recording it for the GPU process, then we might as well record it as GPU_PROCESS_LAUNCH_COUNT

Related: if you record this as a performance histogram GPU_PROCESS_LAUNCH_MS, you get a performance measure *and* a count in one metric "for free". That's something we're going to retrofit for content processes in bug 1304790 so it might be worth doing now.
Oh well I didn't read the patch before the comments...

With GPU_PROCESS_LAUNCH_TIME_MS you don't need the other one at all. And I'd really encourage you to make this an opt-out metric and monitor it for release users as well, not just prerelease/opt-in users.
Attachment #8804848 - Flags: feedback?(benjamin)
Uses the InitFeatureObject method instead.
Attachment #8804847 - Attachment is obsolete: true
Attachment #8805622 - Flags: review?(dvander)
Attachment #8805622 - Flags: feedback?(benjamin)
This is now an opt-out probe, and it only adds the GPU_PROCESS_LAUNCH_TIME_MS probe.

As for D3D11 stuff re. comment 8, we already have sizeable telemetry for various D3D11 failures, through probes like D3D11_COMPOSITING_FAILURE_ID. We can correlate those with the GPU process feature status to get this information.

(On a related note: D3D11_COMPOSITING_FAILURE_ID's alert email is currently set to Benoit Girard, who no longer works here. We should probably update that).
Attachment #8805625 - Flags: review?(dvander)
Attachment #8805625 - Flags: feedback?(benjamin)
Comment on attachment 8805625 [details] [diff] [review]
0002-Bug-1297790-Add-telemetry-probes-for-GPU-process-lau.patch

Review of attachment 8805625 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/ipc/GPUChild.cpp
@@ +91,5 @@
>      return true;
>    }
>  
> +  bool ignore;
> +  Telemetry::AccumulateTimeDelta(Telemetry::GPU_PROCESS_LAUNCH_TIME_MS, TimeStamp::ProcessCreation(ignore));

I don't think this is the time you want - we want the time from when we request async launch of the GPU process, here:

http://searchfox.org/mozilla-central/rev/445097654c1cc1098ee0171a29c439afe363a588/gfx/ipc/GPUProcessHost.cpp#41

Up to the time we acknowledge the GPU process is ready, here:

http://searchfox.org/mozilla-central/rev/445097654c1cc1098ee0171a29c439afe363a588/gfx/ipc/GPUChild.cpp#82

(arguably we could use the time from when the launch starts to when the main IPC channel is open - that could be fine too. though the above is a better estimate of how long the GPU process makes us wait to paint.)

::: toolkit/components/telemetry/Histograms.json
@@ +952,5 @@
>      "bug_numbers": [1255198],
>      "description": "This metric is recorded every time a navigator.geolocation.watchPosition() request gets allowed/fulfilled. A false value is recorded if the owner is not visible according to document.isVisible."
>    },
> +  "GPU_PROCESS_LAUNCH_TIME_MS" : {
> +    "alert_emails": ["george@mozilla.com", "dvander@mozilla.com"],

I actually don't have an alias I think, should be danderson@
Attachment #8805625 - Flags: review?(dvander) → review+
Attachment #8805622 - Flags: review?(dvander) → review+
Carrying forward r+ from David, this patch incorporates the change requested in comment 13, with the exception that we accumulate in both EnsureGPUReady() and RecvInitComplete(). If one is called before the other, the other won't execute due to the early return in both cases. On my machine, at least, RecvInitComplete() executed before EnsureGPUReady().
Attachment #8804848 - Attachment is obsolete: true
Attachment #8805625 - Attachment is obsolete: true
Attachment #8805625 - Flags: feedback?(benjamin)
Attachment #8805667 - Flags: review+
Pushed by gwright@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0a1486cf11b2
Add telemetry probes for GPU process launch/launch-time/abort r=dvander
Leave open for now until I can land the other patch (waiting on sign off for the data collection)
Keywords: leave-open
Pushed by philringnalda@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/076a7d60d508
Add telemetry probes for GPU process launch/launch-time/abort r=dvander
I had to reland it because those were your parent push's sins, not yours.
Flags: needinfo?(gwright)
Comment on attachment 8805622 [details] [diff] [review]
0001-Bug-1297790-Add-GPU-process-feature-status-to-the-Te.patch

Data review is done on the data docs, not the code itself. In this case this should be documented at https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/environment.html (http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/docs/data/environment.rst)

Does this value ever change mid-run? If so do we need to trigger an environment recheck?
Attachment #8805622 - Flags: feedback?(benjamin) → feedback-
We are anticipating that the GPU process can change status during a session (most likely scenario is when falling back to in-process if we detect it's not working properly), but the telemetry environment should refresh for each ping, which should cover that case.
Attachment #8805622 - Attachment is obsolete: true
Attachment #8806106 - Flags: review+
Attachment #8806106 - Flags: feedback?(benjamin)
The telemetry environment doesn't refresh and trigger a subsession split unless you tell it to. So if the value can change after startup, you need to explicitly notify the telemetry environment about changes.
OK, this now refreshes the environment and triggers a subsession split when the GPU process is aborted. I don't think we need to do anything for GPU process creation, as that always happens at startup (at least for now).
Attachment #8805667 - Attachment is obsolete: true
Attachment #8807748 - Flags: review+
Attachment #8807748 - Flags: feedback?(benjamin)
Comment on attachment 8806106 [details] [diff] [review]
add docs for feature status in telemetry environment

Will this block be present at all if the GPU process is *not* available? Or will it look like "gpu-process: {} ? Pls clarify. data-review=me

Mauro can you please help George file any followup bugs about getting this included in longitudinal and other datasets?
Flags: needinfo?(mdoglio)
Attachment #8806106 - Flags: feedback?(benjamin) → feedback+
Comment on attachment 8807748 [details] [diff] [review]
0001-Bug-1297790-Add-GPU-process-feature-status-to-the-Te.patch

Already did data review, but I'm not going to code-review this. If you want code review for the TelemetryEnvironment bits, please ask gfritzsche.
Attachment #8807748 - Flags: feedback?(benjamin)
Attachment #8807748 - Flags: review+ → review?(gfritzsche)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #25)
> Comment on attachment 8806106 [details] [diff] [review]
> Will this block be present at all if the GPU process is *not* available? Or
> will it look like "gpu-process: {} ? Pls clarify. data-review=me

It will look something like "gfx.features.gpu-process.status: unused" (or whatever other feature status it has, the full list of which is here: http://searchfox.org/mozilla-central/source/gfx/src/gfxTelemetry.h#15).
Comment on attachment 8807748 [details] [diff] [review]
0001-Bug-1297790-Add-GPU-process-feature-status-to-the-Te.patch

Review of attachment 8807748 [details] [diff] [review]:
-----------------------------------------------------------------

This also needs to update test_TelemetryEnvironment.js (see checkSystemSection()).

Do we plan to enable the gpu process for all users? If not we may want to make this an optional data field instead.

::: toolkit/components/telemetry/TelemetryEnvironment.jsm
@@ +1072,5 @@
>        gfxData.features = gfxInfo.getFeatures();
> +
> +      if (environmentChange) {
> +        let oldEnvironment = Cu.cloneInto(this._currentEnvironment, myScope);
> +        this._onEnvironmentChange("gfx-features-changed", oldEnvironment);

Lets do this outside of _updateGraphicsFeatures(), keeping with the existing code patterns like for `_onPrefChanged()` etc.

::: toolkit/components/telemetry/docs/data/environment.rst
@@ +190,4 @@
>                  status: <string>,
>                  version: <string>,      // Either "1.0" or "1.1".
>                },
> +              "gpu-process" { // Out-of-process compositing ("GPU process") feature

Please stick with existing style, using camelCase and adding the trailing ":".

@@ +190,5 @@
>                  status: <string>,
>                  version: <string>,      // Either "1.0" or "1.1".
>                },
> +              "gpu-process" { // Out-of-process compositing ("GPU process") feature
> +                status: <string>, // "Available" means currently in use

Please describe this in the docs, if needed in a separate paragraph and pointing/linking to the header:
> It will look something like "gfx.features.gpu-process.status: unused" (or
> whatever other feature status it has, the full list of which is here:
> http://searchfox.org/mozilla-central/source/gfx/src/gfxTelemetry.h#15).
Attachment #8807748 - Flags: review?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #28)
> Comment on attachment 8807748 [details] [diff] [review]
> 0001-Bug-1297790-Add-GPU-process-feature-status-to-the-Te.patch
> 
> Review of attachment 8807748 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This also needs to update test_TelemetryEnvironment.js (see
> checkSystemSection()).
> 
> Do we plan to enable the gpu process for all users? If not we may want to
> make this an optional data field instead.

Yes, eventually. I think we should not be an optional field.

> ::: toolkit/components/telemetry/TelemetryEnvironment.jsm
> @@ +1072,5 @@
> >        gfxData.features = gfxInfo.getFeatures();
> > +
> > +      if (environmentChange) {
> > +        let oldEnvironment = Cu.cloneInto(this._currentEnvironment, myScope);
> > +        this._onEnvironmentChange("gfx-features-changed", oldEnvironment);
> 
> Lets do this outside of _updateGraphicsFeatures(), keeping with the existing
> code patterns like for `_onPrefChanged()` etc.

Will do.

> ::: toolkit/components/telemetry/docs/data/environment.rst
> @@ +190,5 @@
> >                  status: <string>,
> >                  version: <string>,      // Either "1.0" or "1.1".
> >                },
> > +              "gpu-process" { // Out-of-process compositing ("GPU process") feature
> > +                status: <string>, // "Available" means currently in use
> 
> Please describe this in the docs, if needed in a separate paragraph and
> pointing/linking to the header:
> > It will look something like "gfx.features.gpu-process.status: unused" (or
> > whatever other feature status it has, the full list of which is here:
> > http://searchfox.org/mozilla-central/source/gfx/src/gfxTelemetry.h#15).

This is already in those docs: http://searchfox.org/mozilla-central/rev/f5c9e9a249637c9abd88754c8963ecb3838475cb/toolkit/components/telemetry/docs/data/environment.rst#174
I opened bug 1316073 to add the collected data to the longitudinal dataset. That dataset only contains submissions from 1% of the population, but it's very fast to query and it works well in most of the cases. Please let me know if that doesn't fit your requirements.
Flags: needinfo?(mdoglio)
Hopefully final patch incorporating Georg's feedback. I have also modified the d3d11 and d2d fields in the environment documentation to match the style found elsewhere in the document.
Attachment #8806106 - Attachment is obsolete: true
Attachment #8807748 - Attachment is obsolete: true
Attachment #8808772 - Flags: review?(gfritzsche)
Comment on attachment 8808772 [details] [diff] [review]
0001-Bug-1297790-Add-GPU-process-feature-status-to-the-Te.patch

Review of attachment 8808772 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, r=me on the Telemetry & doc changes.

::: toolkit/components/telemetry/TelemetryEnvironment.jsm
@@ +1069,5 @@
> +   */
> +  _onCompositorProcessAborted: function () {
> +    this._log.trace("_onCompositorProcessAborted");
> +
> +    // Finally trigger the environment change notification.

Nit: The comment doesn't seem to fit ("Finally"?).
Attachment #8808772 - Flags: review?(gfritzsche) → review+
Pushed by gwright@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/364305744e4c
Add GPU process feature status to the Telemetry environment r=dvander,gfritzsche data-review=bsmedberg
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5d06681c5215
Backed out changeset 364305744e4c for test bustage
btw since this burned down most of the tests, can we get a try run before the next try to land this, thanks!
Pushed by gwright@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b5e0bcbb12dd
Add GPU process feature status to the Telemetry environment r=dvander,gfritzsche data-review=bsmedberg
Flags: needinfo?(gwright)
Status: NEW → RESOLVED
Closed: 3 years ago
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.