Closed
Bug 1297790
Opened 8 years ago
Closed 7 years ago
Telemetry for GPU Process Behavior
Categories
(Core :: Graphics, defect)
Core
Graphics
Tracking
()
RESOLVED
FIXED
People
(Reporter: dvander, Assigned: gw280)
References
Details
(Whiteboard: [gfx-noted])
Attachments
(1 file, 7 obsolete files)
7.91 KB,
patch
|
gfritzsche
:
review+
|
Details | Diff | Splinter Review |
We should have some very basic metrics for the GPU process: (1) How long it takes to start up, from initial launch to the "GPUReady" ACK. (2) How many times we create the GPU Process (right now, just one) (3) How many times it crashes/dies, via an unexpected GPUChild::ActorDestroy. There may be more specific things we want to know inside the GPU process, but nothing comes to mind yet.
Reporter | ||
Comment 1•8 years ago
|
||
And... (4) An entry in the telemetry environment that contains whether the GPU process is available/enabled/disabled on the platform. We do this for D3D11/D3D9/OGL etc by getting the gfxConfig feature state. That's probably what we want here too.
Updated•8 years ago
|
Whiteboard: [gfx-noted]
Reporter | ||
Updated•8 years ago
|
Blocks: e10s-gpu-win
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → gwright
Assignee | ||
Comment 2•8 years ago
|
||
Attachment #8804847 -
Flags: review?(dvander)
Assignee | ||
Comment 3•8 years ago
|
||
Attachment #8804848 -
Flags: review?(dvander)
Assignee | ||
Updated•8 years ago
|
Attachment #8804847 -
Flags: feedback?(benjamin)
Assignee | ||
Updated•8 years ago
|
Attachment #8804848 -
Flags: feedback?(benjamin)
Reporter | ||
Comment 4•8 years ago
|
||
Comment on attachment 8804848 [details] [diff] [review] 0002-Bug-1297790-Add-telemetry-probes-for-GPU-process-lau.patch Review of attachment 8804848 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/Histograms.json @@ +8634,4 @@ > "releaseChannelCollection": "opt-out", > "description": "Counts of plugin and content process crashes which are reported with a crash dump." > }, > + "SUBPROCESS_LAUNCH_COUNT": { Do we really need a keyed histogram here? Unless someone else has requested this data for another process type, I'd just name it GPU_PROCESS_LAUNCH_COUNT.
Assignee | ||
Comment 5•8 years ago
|
||
Fair comment. I figured I'd follow the existing pattern laid by SUBPROCESS_ABNORMAL_ABORT, but I'm not opposed to just having GPU_PROCESS_LAUNCH_COUNT either.
Reporter | ||
Comment 6•8 years ago
|
||
Comment on attachment 8804847 [details] [diff] [review] 0001-Bug-1297790-Add-GPU-process-feature-state-to-the-Tel.patch Review of attachment 8804847 [details] [diff] [review]: ----------------------------------------------------------------- It would be better to use the built-in feature reporting functionality, for example: http://searchfox.org/mozilla-central/rev/8cf1367dd89cc36ef8f025dfc6af6d5c086838a7/widget/windows/GfxInfo.cpp#1399 InitFeatureObject takes in a name and status and automatically adds some information to Telemetry, including some strings about why stuff failed. You might have to change it so that the nsIGfxInfo part is optional - there's no nsIGfxInfo blocklisting for the GPU process (yet).
Attachment #8804847 -
Flags: review?(dvander)
Attachment #8804847 -
Flags: feedback?(benjamin)
Assignee | ||
Comment 7•8 years ago
|
||
Aha, I didn't notice http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/TelemetryEnvironment.jsm#1070 before; I thought the gfxInfo feature stuff only affected other things like crash reports. Mostly just making this comment so that if anybody else stumbles on it they know about this.
Reporter | ||
Comment 8•8 years ago
|
||
Came up in meeting: We should also have telemetry for when the GPU process fails to get a D3D11 device, since this could imply we're losing acceleration.
Comment 9•8 years ago
|
||
We had a separate request to record the number of content processes we launched, so that we could end up with statistics like total number of content processes that terminated "normally" versus abnormally. If SUBPROCESS_LAUNCH_COUNT is correct for all process types, that's great. But if we're only recording it for the GPU process, then we might as well record it as GPU_PROCESS_LAUNCH_COUNT Related: if you record this as a performance histogram GPU_PROCESS_LAUNCH_MS, you get a performance measure *and* a count in one metric "for free". That's something we're going to retrofit for content processes in bug 1304790 so it might be worth doing now.
Comment 10•8 years ago
|
||
Oh well I didn't read the patch before the comments... With GPU_PROCESS_LAUNCH_TIME_MS you don't need the other one at all. And I'd really encourage you to make this an opt-out metric and monitor it for release users as well, not just prerelease/opt-in users.
Updated•8 years ago
|
Attachment #8804848 -
Flags: feedback?(benjamin)
Assignee | ||
Comment 11•8 years ago
|
||
Uses the InitFeatureObject method instead.
Attachment #8804847 -
Attachment is obsolete: true
Attachment #8805622 -
Flags: review?(dvander)
Attachment #8805622 -
Flags: feedback?(benjamin)
Assignee | ||
Comment 12•8 years ago
|
||
This is now an opt-out probe, and it only adds the GPU_PROCESS_LAUNCH_TIME_MS probe. As for D3D11 stuff re. comment 8, we already have sizeable telemetry for various D3D11 failures, through probes like D3D11_COMPOSITING_FAILURE_ID. We can correlate those with the GPU process feature status to get this information. (On a related note: D3D11_COMPOSITING_FAILURE_ID's alert email is currently set to Benoit Girard, who no longer works here. We should probably update that).
Attachment #8805625 -
Flags: review?(dvander)
Attachment #8805625 -
Flags: feedback?(benjamin)
Reporter | ||
Comment 13•8 years ago
|
||
Comment on attachment 8805625 [details] [diff] [review] 0002-Bug-1297790-Add-telemetry-probes-for-GPU-process-lau.patch Review of attachment 8805625 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/ipc/GPUChild.cpp @@ +91,5 @@ > return true; > } > > + bool ignore; > + Telemetry::AccumulateTimeDelta(Telemetry::GPU_PROCESS_LAUNCH_TIME_MS, TimeStamp::ProcessCreation(ignore)); I don't think this is the time you want - we want the time from when we request async launch of the GPU process, here: http://searchfox.org/mozilla-central/rev/445097654c1cc1098ee0171a29c439afe363a588/gfx/ipc/GPUProcessHost.cpp#41 Up to the time we acknowledge the GPU process is ready, here: http://searchfox.org/mozilla-central/rev/445097654c1cc1098ee0171a29c439afe363a588/gfx/ipc/GPUChild.cpp#82 (arguably we could use the time from when the launch starts to when the main IPC channel is open - that could be fine too. though the above is a better estimate of how long the GPU process makes us wait to paint.) ::: toolkit/components/telemetry/Histograms.json @@ +952,5 @@ > "bug_numbers": [1255198], > "description": "This metric is recorded every time a navigator.geolocation.watchPosition() request gets allowed/fulfilled. A false value is recorded if the owner is not visible according to document.isVisible." > }, > + "GPU_PROCESS_LAUNCH_TIME_MS" : { > + "alert_emails": ["george@mozilla.com", "dvander@mozilla.com"], I actually don't have an alias I think, should be danderson@
Attachment #8805625 -
Flags: review?(dvander) → review+
Reporter | ||
Updated•8 years ago
|
Attachment #8805625 -
Flags: review+
Reporter | ||
Updated•8 years ago
|
Attachment #8805622 -
Flags: review?(dvander) → review+
Reporter | ||
Updated•8 years ago
|
Attachment #8804848 -
Flags: review?(dvander)
Assignee | ||
Comment 14•8 years ago
|
||
Carrying forward r+ from David, this patch incorporates the change requested in comment 13, with the exception that we accumulate in both EnsureGPUReady() and RecvInitComplete(). If one is called before the other, the other won't execute due to the early return in both cases. On my machine, at least, RecvInitComplete() executed before EnsureGPUReady().
Attachment #8804848 -
Attachment is obsolete: true
Attachment #8805625 -
Attachment is obsolete: true
Attachment #8805625 -
Flags: feedback?(benjamin)
Attachment #8805667 -
Flags: review+
Comment 15•8 years ago
|
||
Pushed by gwright@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/0a1486cf11b2 Add telemetry probes for GPU process launch/launch-time/abort r=dvander
Assignee | ||
Comment 16•8 years ago
|
||
Leave open for now until I can land the other patch (waiting on sign off for the data collection)
Keywords: leave-open
I had to back this part out for failures like https://treeherder.mozilla.org/logviewer.html#?job_id=38400044&repo=mozilla-inbound https://hg.mozilla.org/integration/mozilla-inbound/rev/2777fabbe62a
Flags: needinfo?(gwright)
Comment 18•8 years ago
|
||
Pushed by philringnalda@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/076a7d60d508 Add telemetry probes for GPU process launch/launch-time/abort r=dvander
Comment 19•8 years ago
|
||
I had to reland it because those were your parent push's sins, not yours.
Flags: needinfo?(gwright)
Comment 20•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/076a7d60d508
Comment 21•8 years ago
|
||
Comment on attachment 8805622 [details] [diff] [review] 0001-Bug-1297790-Add-GPU-process-feature-status-to-the-Te.patch Data review is done on the data docs, not the code itself. In this case this should be documented at https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/environment.html (http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/docs/data/environment.rst) Does this value ever change mid-run? If so do we need to trigger an environment recheck?
Attachment #8805622 -
Flags: feedback?(benjamin) → feedback-
Assignee | ||
Comment 22•8 years ago
|
||
We are anticipating that the GPU process can change status during a session (most likely scenario is when falling back to in-process if we detect it's not working properly), but the telemetry environment should refresh for each ping, which should cover that case.
Attachment #8805622 -
Attachment is obsolete: true
Attachment #8806106 -
Flags: review+
Attachment #8806106 -
Flags: feedback?(benjamin)
Comment 23•8 years ago
|
||
The telemetry environment doesn't refresh and trigger a subsession split unless you tell it to. So if the value can change after startup, you need to explicitly notify the telemetry environment about changes.
Assignee | ||
Comment 24•8 years ago
|
||
OK, this now refreshes the environment and triggers a subsession split when the GPU process is aborted. I don't think we need to do anything for GPU process creation, as that always happens at startup (at least for now).
Attachment #8805667 -
Attachment is obsolete: true
Attachment #8807748 -
Flags: review+
Attachment #8807748 -
Flags: feedback?(benjamin)
Comment 25•8 years ago
|
||
Comment on attachment 8806106 [details] [diff] [review] add docs for feature status in telemetry environment Will this block be present at all if the GPU process is *not* available? Or will it look like "gpu-process: {} ? Pls clarify. data-review=me Mauro can you please help George file any followup bugs about getting this included in longitudinal and other datasets?
Flags: needinfo?(mdoglio)
Attachment #8806106 -
Flags: feedback?(benjamin) → feedback+
Comment 26•8 years ago
|
||
Comment on attachment 8807748 [details] [diff] [review] 0001-Bug-1297790-Add-GPU-process-feature-status-to-the-Te.patch Already did data review, but I'm not going to code-review this. If you want code review for the TelemetryEnvironment bits, please ask gfritzsche.
Attachment #8807748 -
Flags: feedback?(benjamin)
Assignee | ||
Updated•8 years ago
|
Attachment #8807748 -
Flags: review+ → review?(gfritzsche)
Assignee | ||
Comment 27•8 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #25) > Comment on attachment 8806106 [details] [diff] [review] > Will this block be present at all if the GPU process is *not* available? Or > will it look like "gpu-process: {} ? Pls clarify. data-review=me It will look something like "gfx.features.gpu-process.status: unused" (or whatever other feature status it has, the full list of which is here: http://searchfox.org/mozilla-central/source/gfx/src/gfxTelemetry.h#15).
Comment 28•8 years ago
|
||
Comment on attachment 8807748 [details] [diff] [review] 0001-Bug-1297790-Add-GPU-process-feature-status-to-the-Te.patch Review of attachment 8807748 [details] [diff] [review]: ----------------------------------------------------------------- This also needs to update test_TelemetryEnvironment.js (see checkSystemSection()). Do we plan to enable the gpu process for all users? If not we may want to make this an optional data field instead. ::: toolkit/components/telemetry/TelemetryEnvironment.jsm @@ +1072,5 @@ > gfxData.features = gfxInfo.getFeatures(); > + > + if (environmentChange) { > + let oldEnvironment = Cu.cloneInto(this._currentEnvironment, myScope); > + this._onEnvironmentChange("gfx-features-changed", oldEnvironment); Lets do this outside of _updateGraphicsFeatures(), keeping with the existing code patterns like for `_onPrefChanged()` etc. ::: toolkit/components/telemetry/docs/data/environment.rst @@ +190,4 @@ > status: <string>, > version: <string>, // Either "1.0" or "1.1". > }, > + "gpu-process" { // Out-of-process compositing ("GPU process") feature Please stick with existing style, using camelCase and adding the trailing ":". @@ +190,5 @@ > status: <string>, > version: <string>, // Either "1.0" or "1.1". > }, > + "gpu-process" { // Out-of-process compositing ("GPU process") feature > + status: <string>, // "Available" means currently in use Please describe this in the docs, if needed in a separate paragraph and pointing/linking to the header: > It will look something like "gfx.features.gpu-process.status: unused" (or > whatever other feature status it has, the full list of which is here: > http://searchfox.org/mozilla-central/source/gfx/src/gfxTelemetry.h#15).
Attachment #8807748 -
Flags: review?(gfritzsche)
Assignee | ||
Comment 29•8 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #28) > Comment on attachment 8807748 [details] [diff] [review] > 0001-Bug-1297790-Add-GPU-process-feature-status-to-the-Te.patch > > Review of attachment 8807748 [details] [diff] [review]: > ----------------------------------------------------------------- > > This also needs to update test_TelemetryEnvironment.js (see > checkSystemSection()). > > Do we plan to enable the gpu process for all users? If not we may want to > make this an optional data field instead. Yes, eventually. I think we should not be an optional field. > ::: toolkit/components/telemetry/TelemetryEnvironment.jsm > @@ +1072,5 @@ > > gfxData.features = gfxInfo.getFeatures(); > > + > > + if (environmentChange) { > > + let oldEnvironment = Cu.cloneInto(this._currentEnvironment, myScope); > > + this._onEnvironmentChange("gfx-features-changed", oldEnvironment); > > Lets do this outside of _updateGraphicsFeatures(), keeping with the existing > code patterns like for `_onPrefChanged()` etc. Will do. > ::: toolkit/components/telemetry/docs/data/environment.rst > @@ +190,5 @@ > > status: <string>, > > version: <string>, // Either "1.0" or "1.1". > > }, > > + "gpu-process" { // Out-of-process compositing ("GPU process") feature > > + status: <string>, // "Available" means currently in use > > Please describe this in the docs, if needed in a separate paragraph and > pointing/linking to the header: > > It will look something like "gfx.features.gpu-process.status: unused" (or > > whatever other feature status it has, the full list of which is here: > > http://searchfox.org/mozilla-central/source/gfx/src/gfxTelemetry.h#15). This is already in those docs: http://searchfox.org/mozilla-central/rev/f5c9e9a249637c9abd88754c8963ecb3838475cb/toolkit/components/telemetry/docs/data/environment.rst#174
Comment 30•8 years ago
|
||
I opened bug 1316073 to add the collected data to the longitudinal dataset. That dataset only contains submissions from 1% of the population, but it's very fast to query and it works well in most of the cases. Please let me know if that doesn't fit your requirements.
Flags: needinfo?(mdoglio)
Assignee | ||
Comment 31•8 years ago
|
||
Hopefully final patch incorporating Georg's feedback. I have also modified the d3d11 and d2d fields in the environment documentation to match the style found elsewhere in the document.
Attachment #8806106 -
Attachment is obsolete: true
Attachment #8807748 -
Attachment is obsolete: true
Attachment #8808772 -
Flags: review?(gfritzsche)
Comment 32•8 years ago
|
||
Comment on attachment 8808772 [details] [diff] [review] 0001-Bug-1297790-Add-GPU-process-feature-status-to-the-Te.patch Review of attachment 8808772 [details] [diff] [review]: ----------------------------------------------------------------- Thanks, r=me on the Telemetry & doc changes. ::: toolkit/components/telemetry/TelemetryEnvironment.jsm @@ +1069,5 @@ > + */ > + _onCompositorProcessAborted: function () { > + this._log.trace("_onCompositorProcessAborted"); > + > + // Finally trigger the environment change notification. Nit: The comment doesn't seem to fit ("Finally"?).
Attachment #8808772 -
Flags: review?(gfritzsche) → review+
Comment 33•8 years ago
|
||
Pushed by gwright@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/364305744e4c Add GPU process feature status to the Telemetry environment r=dvander,gfritzsche data-review=bsmedberg
Comment 34•8 years ago
|
||
backed out for bustage like https://treeherder.mozilla.org/logviewer.html#?job_id=38877777&repo=mozilla-inbound
Flags: needinfo?(gwright)
Comment 35•8 years ago
|
||
Backout by cbook@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/5d06681c5215 Backed out changeset 364305744e4c for test bustage
Comment 36•8 years ago
|
||
btw since this burned down most of the tests, can we get a try run before the next try to land this, thanks!
Comment 37•7 years ago
|
||
Pushed by gwright@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/b5e0bcbb12dd Add GPU process feature status to the Telemetry environment r=dvander,gfritzsche data-review=bsmedberg
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(gwright)
Comment 38•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/b5e0bcbb12dd
Reporter | ||
Updated•7 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•