Closed Bug 1393392 Opened 2 years ago Closed 2 years ago

Deblacklisting HW decoding and add Telemetry to see the crashing status.

Categories

(Core :: Audio/Video: Playback, enhancement, P3)

x86_64
Windows
enhancement

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: kaku, Assigned: kaku, NeedInfo)

References

(Blocks 1 open bug)

Details

Attachments

(5 files, 6 obsolete files)

59 bytes, text/x-review-board-request
gerald
: review+
mattwoodrow
: review+
Details
59 bytes, text/x-review-board-request
francois
: review+
gerald
: review+
Details
59 bytes, text/x-review-board-request
gerald
: review+
Details
59 bytes, text/x-review-board-request
mattwoodrow
: review+
Details
59 bytes, text/x-review-board-request
gerald
: review+
mattwoodrow
: review+
Details
Spawn from bug 1372070 comment 4.

We're going to remove the HW decoding blacklist.

In this bug, we check if a driver would have been blacklisted but do not blacklist it and then report its runtime status through Telemetry.

For each time a decoder is created, it will end up to one of the following cases:
(1) enabling this previously-blacklisted driver (e.g., HW decoding) and didn't crash.
(2) enabling this previously-blacklisted driver (e.g., HW decoding) and crashed.
(3) re-blacklisting this previously-blacklisted driver (e.g., SW decoding) and didn't crash
(4) re-blacklisting this previously-blacklisted driver (e.g., SW decoding) and crashed.

The data will help us to see if a driver should be kept blacklisted or not.
Blocks: 1372070
Assignee: nobody → kaku
Status: NEW → ASSIGNED
Priority: -- → P3
Attachment #8900773 - Flags: review?(matt.woodrow)
Attachment #8900773 - Flags: review?(gsquelart)
Attachment #8900774 - Flags: review?(gsquelart)
Attachment #8900774 - Flags: review?(francois)
Attachment #8900775 - Flags: review?(matt.woodrow)
Attachment #8900775 - Flags: review?(gsquelart)
Attachment #8900776 - Flags: review?(matt.woodrow)
Attachment #8900776 - Flags: review?(gsquelart)
Attachment #8900777 - Flags: review?(matt.woodrow)
Attachment #8900777 - Flags: review?(gsquelart)
Attachment #8900778 - Flags: review?(matt.woodrow)
Attachment #8900778 - Flags: review?(gsquelart)
Attachment #8900779 - Flags: review?(matt.woodrow)
Attachment #8900779 - Flags: review?(gsquelart)
Comment on attachment 8901083 [details]
Bug 1393392 P0 - fix compiling error;

https://reviewboard.mozilla.org/r/172540/#review177854
Attachment #8901083 - Flags: review?(gsquelart) → review+
Comment on attachment 8900773 [details]
Bug 1393392 P1 - a preference to ignore hw-decoding blacklist in gpu process;

https://reviewboard.mozilla.org/r/172084/#review177886

::: commit-message-d1c70:1
(Diff revision 2)
> +Bug 1393392 P1 - deblacklisting;

A few more words would be nice. Maybe something like:
"media.wmf.deblacklisting-for-telemetry"=true ignores video-decoding blacklisting.
Attachment #8900773 - Flags: review?(gsquelart) → review+
Comment on attachment 8900774 [details]
Bug 1393392 P2 - add Telemetry probe;

https://reviewboard.mozilla.org/r/172086/#review177862

::: toolkit/components/telemetry/Histograms.json:13548
(Diff revision 2)
> +    "expires_in_version": "60",
> +    "releaseChannelCollection": "opt-out",
> +    "keyed": true,
> +    "kind": "enumerated",
> +    "n_values":  10,
> +    "bug_numbers": [1372070],

Bug number should be 1393392.

::: toolkit/components/telemetry/Histograms.json:13549
(Diff revision 2)
> +    "releaseChannelCollection": "opt-out",
> +    "keyed": true,
> +    "kind": "enumerated",
> +    "n_values":  10,
> +    "bug_numbers": [1372070],
> +    "description": "The runtime status of previously-blacklisted driver: 0:HW-decoding and GPU process didn't crash, 1:HW-decoding and GPU process crashed, 2:SW-decoding and GPU process didn't crash, 3:SW-decoding and GPU process crashed."

Please describe what the keys are.
Attachment #8900774 - Flags: review?(gsquelart) → review+
Comment on attachment 8900775 [details]
Bug 1393392 P3 - a help funtion to record telemetry data;

https://reviewboard.mozilla.org/r/172088/#review177864

::: commit-message-dd000:1
(Diff revision 2)
> +Bug 1393392 P3 - a help funtion to record telemetry data;

'help funtion' -> 'helper function'

::: gfx/ipc/GPUProcessManager.cpp:67
(Diff revision 2)
> +  const nsCString& blacklistedDLL = !aD3D11BlacklistedDriver.IsEmpty()
> +                                    ? aD3D11BlacklistedDriver
> +                                    : aD3D9BlacklistedDriver;
> +
> +  if (!blacklistedDLL.IsEmpty()) {
> +    const uint32_t type = [isHWDecodingOnGPUProcess, isGPUProcessCrashed]() {

Why a lambda here? It feels a bit awkward; please consider a separate function, or just use `?:`.

::: gfx/ipc/GPUProcessManager.cpp:71
(Diff revision 2)
> +        }
> +        else {

Should be: '} else {' on one line. (more below)
Did you run `./mach clang-format`?
Attachment #8900775 - Flags: review?(gsquelart) → review+
Comment on attachment 8900776 [details]
Bug 1393392 P4 - keep DXVA driver information in GPUProcessManager;

https://reviewboard.mozilla.org/r/172090/#review177870

LGTM, but don't trust me to the IPC review (not enough experience), hopefully Matt should know best about that.

::: commit-message-9a182:4
(Diff revision 2)
> +Bug 1393392 P4 - keep DXVA driver information in GPUProcessManager;
> +
> +We need to keep the information in GPUProcessManager because we still
> +need the information aftern GPU has crashed, so that's why we cannot

'aftern' -> 'after'

::: dom/media/ipc/VideoDecoderManagerParent.cpp:9
(Diff revision 2)
>   * License, v. 2.0. If a copy of the MPL was not distributed with this
>   * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
>  #include "VideoDecoderManagerParent.h"
>  #include "VideoDecoderParent.h"
>  #include "base/thread.h"
> +#include "mozilla/gfx/GPUParent.h"      // for GPUParent

I don't think you need this comment here. :-)
Attachment #8900776 - Flags: review?(gsquelart) → review+
Comment on attachment 8900777 [details]
Bug 1393392 P5 - report telemetry while GPU process crashing;

https://reviewboard.mozilla.org/r/172092/#review177894

::: commit-message-a344f:1
(Diff revision 2)
> +Bug 1393392 P5 - report telemetry while GPU process crashing;

'... when GPU process crashes'
Attachment #8900777 - Flags: review?(gsquelart) → review+
Comment on attachment 8900778 [details]
Bug 1393392 P6 - keep DXVA driver information in VideoDecoderManagerChild;

https://reviewboard.mozilla.org/r/172094/#review177898

I'd prefer Matt to take the review lead on this one. Just one suggestion:

::: dom/media/ipc/PVideoDecoderManager.ipdl:26
(Diff revision 2)
>  
>    sync Readback(SurfaceDescriptorGPUVideo sd) returns (SurfaceDescriptor aResult);
>  
>    async DeallocateSurfaceDescriptorGPUVideo(SurfaceDescriptorGPUVideo sd);
> +
> +  sync ReadBlacklistedDrivers() returns (nsCString aD3D11Driver, nsCString aD3D9Driver);

Would it be possible to have an async message instead?
Attachment #8900778 - Flags: review?(gsquelart)
Comment on attachment 8901084 [details]
Bug 1393392 P7 - export the VideoDecoderChild::mCanSend information;

https://reviewboard.mozilla.org/r/172542/#review177900
Attachment #8901084 - Flags: review?(gsquelart) → review+
Comment on attachment 8900779 [details]
Bug 1393392 P8 - report telemetry while a decoder being normally closed;

https://reviewboard.mozilla.org/r/172096/#review177902

::: commit-message-25f44:1
(Diff revision 2)
> +Bug 1393392 P8 - report telemetry while a decoder being normally closed;

while...being -> when...is
Attachment #8900779 - Flags: review?(gsquelart) → review+
Comment on attachment 8900778 [details]
Bug 1393392 P6 - keep DXVA driver information in VideoDecoderManagerChild;

https://reviewboard.mozilla.org/r/172094/#review177898

> Would it be possible to have an async message instead?

I have tried to implement it in an async way, see https://reviewboard.mozilla.org/r/172094/diff/1/. 
The implementation send an async IPC call from VideoDecoderManagerParent::AllocPVideoDecoderParent() to VideoDecoderManagerChild and save the information as static variables in VideoDecoderManagerChild. 
In this way, we write the static variables in sVideoDecoderChildThread and read them in MediaFormatReader's TaskQueue, which might race. 
To prevent racing, we can read those static variables by dispatching a sync task from MediaFormatReader's TaskQueue to sVideoDecoderChildThread.

So, we have sync operations in both implementations, and I chose the current one because the sync operation is hidden by the IPC mechanism which leads to less code and we can invoke the sync operation in an already-existing sync operation, which I feel more comfortable:P

Maybe Matt can give us a better insight on this issue!
Comment on attachment 8900774 [details]
Bug 1393392 P2 - add Telemetry probe;

https://reviewboard.mozilla.org/r/172086/#review178162

::: toolkit/components/telemetry/Histograms.json:13549
(Diff revision 2)
> +    "releaseChannelCollection": "opt-out",
> +    "keyed": true,
> +    "kind": "enumerated",
> +    "n_values":  10,
> +    "bug_numbers": [1372070],
> +    "description": "The runtime status of previously-blacklisted driver: 0:HW-decoding and GPU process didn't crash, 1:HW-decoding and GPU process crashed, 2:SW-decoding and GPU process didn't crash, 3:SW-decoding and GPU process crashed."

When is this probe sent? When the GPU process exits?
Attachment #8900774 - Flags: review?(francois) → review-
Comment on attachment 8900773 [details]
Bug 1393392 P1 - a preference to ignore hw-decoding blacklist in gpu process;

https://reviewboard.mozilla.org/r/172084/#review178324

::: dom/media/platforms/wmf/WMFVideoMFTManager.cpp:409
(Diff revision 2)
>      nsACString* failureReason = &mFailureReason;
>      nsCString secondFailureReason;
>      if (mBackend == LayersBackend::LAYERS_D3D11 &&
>        gfxPrefs::PDMWMFAllowD3D11() && IsWin8OrLater()) {
>        const nsCString& blacklistedDLL = FindD3D11BlacklistedDLL();
> -      if (!blacklistedDLL.IsEmpty()) {
> +      if (!gfxPrefs::PDMWMFDeblacklistingForTelemetry() && !blacklistedDLL.IsEmpty()) {

Should we only disable the blacklist when we're actually in the GPU process?

I know we're not planning on shipping with this pref enabled, but it still seems safer to only do it when we can recover.
Attachment #8900773 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8900775 [details]
Bug 1393392 P3 - a help funtion to record telemetry data;

https://reviewboard.mozilla.org/r/172088/#review178326

r=me with Gerald's comments fixed.
Attachment #8900775 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8900776 [details]
Bug 1393392 P4 - keep DXVA driver information in GPUProcessManager;

https://reviewboard.mozilla.org/r/172090/#review178328

It looks like we submit one telemetry entry for each successful decoder shutdown, but when we crash we submit only a single entry (even though there could be multiple videos playing).
Won't this skew the results to look like we crash less than we do (especially for the multiple video case, which seems likely to be crashier).

I also don't love the idea of adding video decoding specific things to PGPU/GPUProcessManager, since they'd ideally be only for generic GPU process code.

Can we instead make the sharing of the driver version part of PVideoDecoder, so that each time we initialize a decoder, we make sure the child process end (VideoDecoderChild) knows the driver version (and if we've skipped the blacklist for it).
Then we can handle both cases (normal shutdown, and crashing) from VideoDecoderChild::ActorDestroy (AbnormalShutdown gets passed for the crash), and we can minimize how many files need to know about this.
(In reply to Matt Woodrow (:mattwoodrow) from comment #30)
> Comment on attachment 8900776 [details]
> Bug 1393392 P4 - keep DXVA driver information in GPUProcessManager;
> 
> https://reviewboard.mozilla.org/r/172090/#review178328
> 
> It looks like we submit one telemetry entry for each successful decoder
> shutdown, but when we crash we submit only a single entry (even though there
> could be multiple videos playing).
> Won't this skew the results to look like we crash less than we do
> (especially for the multiple video case, which seems likely to be crashier).
Good point and thanks! I think this matters a lot.

> I also don't love the idea of adding video decoding specific things to
> PGPU/GPUProcessManager, since they'd ideally be only for generic GPU process
> code.
> 
> Can we instead make the sharing of the driver version part of PVideoDecoder,
> so that each time we initialize a decoder, we make sure the child process
> end (VideoDecoderChild) knows the driver version (and if we've skipped the
> blacklist for it).
> Then we can handle both cases (normal shutdown, and crashing) from
> VideoDecoderChild::ActorDestroy (AbnormalShutdown gets passed for the
> crash), and we can minimize how many files need to know about this.
This is good and I somehow implemented it in patch 6 (https://reviewboard.mozilla.org/r/172094/diff/2/), although I instead pass and save the information in VideoDecoderManagerChild.

However, in this way, we're not able to detect GPU process crash while we're using a software decoder.

I think that when we de-blacklist a driver and enable HW-decoding once but encounters into a GPU crash, we fallback to do SW-decoding.  In this case, we should keep monitoring GPU crashes so that we have the idea that if the GPU crashes are highly related to the previously-blacklisted driver or not (although we are not able to confirm...).

So, do you think it makes sense to keep monitoring GPU crashes while falling back to SW-decoding?
Flags: needinfo?(matt.woodrow)
Yeah, it doesn't seem like there's much we can do to confirm in that case.

Also, if the GPU process is still crashing after we've disabled HW decoding (since the problem is with something else), then the only solution is to disable the whole GPU process. Once the GPU process is disabled, it no longer matters what the blacklisting state for HW decoding is, since we'll always be software.
Flags: needinfo?(matt.woodrow)
Comment on attachment 8900773 [details]
Bug 1393392 P1 - a preference to ignore hw-decoding blacklist in gpu process;

https://reviewboard.mozilla.org/r/172084/#review178324

> Should we only disable the blacklist when we're actually in the GPU process?
> 
> I know we're not planning on shipping with this pref enabled, but it still seems safer to only do it when we can recover.

Yes, we should. 
Will then consider both `XRE_IsGPUProcess()` and `media.wmf.deblacklisting-for-telemetry`.
Comment on attachment 8900774 [details]
Bug 1393392 P2 - add Telemetry probe;

https://reviewboard.mozilla.org/r/172086/#review178162

> When is this probe sent? When the GPU process exits?

The telemetry data is submitted when a remote HW video decoder is shutdown.
Will also update in the next version's commit message.
Attachment #8900775 - Attachment is obsolete: true
Attachment #8900776 - Attachment is obsolete: true
Attachment #8900776 - Flags: review?(matt.woodrow)
Attachment #8900777 - Attachment is obsolete: true
Attachment #8900777 - Flags: review?(matt.woodrow)
Attachment #8900778 - Attachment is obsolete: true
Attachment #8900778 - Flags: review?(matt.woodrow)
Attachment #8901084 - Attachment is obsolete: true
Attachment #8901084 - Flags: review?(matt.woodrow)
Attachment #8900779 - Attachment is obsolete: true
Attachment #8900779 - Flags: review?(matt.woodrow)
Comment on attachment 8900774 [details]
Bug 1393392 P2 - add Telemetry probe;

https://reviewboard.mozilla.org/r/172086/#review179122

datareview+

::: toolkit/components/telemetry/Histograms.json:13549
(Diff revision 3)
> +    "releaseChannelCollection": "opt-out",
> +    "keyed": true,
> +    "kind": "enumerated",
> +    "n_values":  10,
> +    "bug_numbers": [1393392],
> +    "description": "The runtime status of previously-blacklisted driver: 0:remote HW-decoding and GPU process didn't crash, 1:remote HW-decoding and GPU process crashed. The histogram is keyed by the blacklisted driver name."

Let's include in the description a few words about when the probe is sent:

"Runtime status, sent when a remote HW video decoder is shutdown, of a previously-blacklisted driver:..."
Attachment #8900774 - Flags: review?(francois) → review+
Comment on attachment 8902188 [details]
Bug 1393392 P4 - report telemetry when we shutdown a remote HW decoder;

https://reviewboard.mozilla.org/r/173620/#review179222
Attachment #8902188 - Flags: review?(gsquelart) → review+
Comment on attachment 8902187 [details]
Bug 1393392 P3 - keep DXVA driver information in VideoDecoderChild;

https://reviewboard.mozilla.org/r/173618/#review179292
Attachment #8902187 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8902188 [details]
Bug 1393392 P4 - report telemetry when we shutdown a remote HW decoder;

https://reviewboard.mozilla.org/r/173620/#review179294
Attachment #8902188 - Flags: review?(matt.woodrow) → review+
Try looks good, thanks for the review!
Pushed by tkuo@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/829acc100ac7
P0 - fix compiling error; r=gerald
https://hg.mozilla.org/integration/autoland/rev/3a078caef5c5
P1 - a preference to ignore hw-decoding blacklist in gpu process; r=gerald,mattwoodrow
https://hg.mozilla.org/integration/autoland/rev/c81e6970c3ff
P2 - add Telemetry probe; r=francois,gerald
https://hg.mozilla.org/integration/autoland/rev/82d96d1fe80b
P3 - keep DXVA driver information in VideoDecoderChild; r=mattwoodrow
https://hg.mozilla.org/integration/autoland/rev/a4e0ecc18cae
P4 - report telemetry when we shutdown a remote HW decoder; r=gerald,mattwoodrow
Kaku,
Is there any patch worth uplifting to Beta to get more data?
Flags: needinfo?(kaku)
(In reply to Blake Wu [:bwu][:blakewu] from comment #50)
> Kaku,
> Is there any patch worth uplifting to Beta to get more data?

Yes, but bug 1388309 must precede.
Flags: needinfo?(kaku)
Depends on: 1403499
Do we have the telemetry results?
Flags: needinfo?(kaku)
Add more info about how to read telemetry. 
For "Start End", "0   1" means "0:remote HW-decoding and GPU process didn't crash" and "1   2" means "remote 1:HW-decoding and GPU process crashed".
For 57, I think we should still have a blacklist to block those drivers which crashes a lot observed in this telemetry.

Anthony and Milan,
What do you think?
Flags: needinfo?(milan)
Flags: needinfo?(ajones)
(In reply to Blake Wu [:bwu][:blakewu] from comment #55)
> For 57, I think we should still have a blacklist to block those drivers
> which crashes a lot observed in this telemetry.
> 
> Anthony and Milan,
> What do you think?

What specifically are you proposing? I can live with a recoverable crash happening in 1 in 10,000 sessions.
Flags: needinfo?(ajones)
(In reply to Anthony Jones (:kentuckyfriedtakahe, :k17e) from comment #56)
> (In reply to Blake Wu [:bwu][:blakewu] from comment #55)
> > For 57, I think we should still have a blacklist to block those drivers
> > which crashes a lot observed in this telemetry.
> > 
> > Anthony and Milan,
> > What do you think?
> 
> What specifically are you proposing? I can live with a recoverable crash
I am not sure we should remove the whole blacklist in 57. Maybe we should. But at least we should check the telemetry results and update our blacklist.  
> happening in 1 in 10,000 sessions.
Then my question is what is the threshold of crash ratio we think user can live with with recoverable mechanism? 

Benjamin, 
Per discussion, Could you help analyze the telemetry results to see if there is any high crash rate case?
Flags: needinfo?(bechen)
Blake, from the graph result you provided, the crash rate doesn't even register in the percentage scale.

While millions of users with those devices enjoy the benefits (lower CPU usage, better battery life, smoother user experience).

So why spoil it for those millions when the crash rate is 168/11800000: 0.0014% !!

It's a recoverable crash, and the recovery is almost transparent for the user which was the entire point of having a GPU process.

Having a blacklist once again defeat the purpose.
I am with you. For the low crash rate, like 0.0014%, we don't need to blacklist it. I am not insane. :-) 
Currently I would like to have an overall sanity check if we should remove the blacklist or resize the blacklist. Benjamin will check the telemetry results. If every driver has quite low crash rate, I vote for removing the whole blacklist as well. Let's see more numbers and decide the next step.
I just check the whole drivers, the results looks great.

The highest crash rate is only 0.08%~0.14%.

atidxx32.dll(8.17.10.519)
total : 78.97k (99.86%)
crash : 114 (0.14%)

atidxx32.dll(8.17.10.525)
total : 1.25M (99.92%)
crash : 942 (0.08%)

igd10iumd32.dll(20.19.15.4311)
total : 28.41k (99.92%)
crash : 24 (0.08%)
Flags: needinfo?(bechen)
Benjamin,
Thanks! Good to know. 
I think we also need to check if there are other reasons we decided to block some driver instead of crash.
You need to log in before you can comment on or make changes to this bug.