Closed Bug 1388309 Opened 7 years ago Closed 7 years ago

Video decoding is not recovered from GPU process crash.

Categories

(Core :: Audio/Video: Playback, defect, P1)

x86_64
Windows
defect

Tracking

()

VERIFIED FIXED
mozilla57
Tracking Status
firefox-esr52 --- unaffected
firefox55 --- wontfix
firefox56 + verified
firefox57 --- verified

People

(Reporter: kaku, Assigned: kaku)

References

(Blocks 1 open bug)

Details

(Keywords: regression, Whiteboard: [testcoverage] )

Attachments

(1 file, 3 obsolete files)

Steps to reproduce:

* (Optional) Set media.suspend-bkgnd-video.enabled to false in about:config.
* Play accelerated video (e.g. H.264 on YouTube).
* Kill compositor process.

Expected results:

Video keeps playing.

Actual results:

Some error message is shown on the YT player.

Regression window:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=26a0e8b9f27895eb55e21f53e3f086bbf809472a&tochange=5c9b668ee568bd506fe720868e3c018e2b71eb89

Probably be a regression of bug 1338011.
Blocks: 1338011, 1372070
Brief analysis:

When the GPU process crashes, we set GPUProcessManager::mDecodeVideoOnGpuProcess = false here:
http://searchfox.org/mozilla-central/rev/bd39b6170f04afeefc751a23bb04e18bbd10352b/gfx/ipc/GPUProcessManager.cpp#508

And then, the GPUProcessManager will never create PVideoDecoderManager actors again:
http://searchfox.org/mozilla-central/rev/bd39b6170f04afeefc751a23bb04e18bbd10352b/gfx/ipc/GPUProcessManager.cpp#947

But we keep using RemoteDecoderModule which tries to initialize a VidoeDecoderChild on the content process and then fails.
So that, no more video decoding.
Hi Gerald,

Do you think comment 1 is the right root cause?
Flags: needinfo?(gsquelart)
Keywords: regression
Priority: -- → P1
[Tracking Requested - why for this release]:
Video stop playing when GPU crashes.
Whiteboard: [testcoverage]
I'm afraid I don't know that much about this code.

But your analysis in comment 1 makes sense.
It means we should stop creating RemoteVideoDecoder's, e.g.: either remove RemoteDecoderModule from PDMFactory::mCurrentPDMs, or somehow make RemoteDecoderModule::Supports() always return false so it won't be used to create decoders anymore.

Matt, you worked on the GPU process and remote video decoder -- ideas, suggestions?
Flags: needinfo?(gsquelart) → needinfo?(matt.woodrow)
RemoteDecoderModule::CreateVideoDecoder already has a fallback path to the normal PDM that gets taken when the GPU process no longer exists.

Can we just add a way to check for the state where there is a GPU process, the remote decode pref is true, but we still don't allow remote decoding?

VideoDecoderManagerChild::Open should have been called (with a null endpoint) when this happens, and CreateVideoDecoder does a sync task to the same thread, so we should be able to check state there, and gracefully handle failure.

VideoDecoderChild::InitIPDL intentionally pretends that it succeeded if the manager is unavailable, to handle the case where we try create a new decoder in the middle of a GPU process crash. The new 'fake' decoder will return NS_ERROR_DOM_MEDIA_NEED_NEW_DECODER, trigger a recreation, and hopefully the manager will be recreated by then.

We want to distinguish between the manager being unavailable because we've just crashed and are working on reinitializing (where we return true), and recreation being completed but we've just stopped supporting remote video (return false, and then fallback to the other PDM).
Flags: needinfo?(matt.woodrow)
Milan, any ideas here?
Flags: needinfo?(milan)
Matt's suggestion in comment 5 seems reasonable.

Anthony, I'm assuming we don't want to back out bug 1338011 that regressed this?
Flags: needinfo?(milan) → needinfo?(ajones)
(In reply to Matt Woodrow (:mattwoodrow) from comment #5)
> RemoteDecoderModule::CreateVideoDecoder already has a fallback path to the
> normal PDM that gets taken when the GPU process no longer exists.
> 
> Can we just add a way to check for the state where there is a GPU process,
> the remote decode pref is true, but we still don't allow remote decoding?
Do we really want to stop remote decodding in this case? 
I think bug 1338011 wants a software decoder in the GPU process, right?
(In reply to Tzuhao Kuo [:kaku] (PTO on 8/9) from comment #8)
> (In reply to Matt Woodrow (:mattwoodrow) from comment #5)
> > RemoteDecoderModule::CreateVideoDecoder already has a fallback path to the
> > normal PDM that gets taken when the GPU process no longer exists.
> > 
> > Can we just add a way to check for the state where there is a GPU process,
> > the remote decode pref is true, but we still don't allow remote decoding?
> Do we really want to stop remote decodding in this case? 
> I think bug 1338011 wants a software decoder in the GPU process, right?

I don't think so, it doesn't restart the video IPC channel, so we can't have remote decoders.

Running software decoding in the GPU process would work too, but that's basically backing out bug 1338011 and starting again.

I don't have a strong opinion as to which way we go.
(In reply to Tzuhao Kuo [:kaku] (PTO on 8/9) from comment #8)
> Do we really want to stop remote decodding in this case? 
> I think bug 1338011 wants a software decoder in the GPU process, right?

Good point.

I would argue that the main reason we wanted decoding to happen in the GPU process, was to catch OS&driver issues for which we cannot do anything; whereas software decoding is done by our software, and so if there are crashes we should be able to fix them.
Of course, if we can do software decoding in the GPU process, it would be nice too, for better isolation.

In any case, whether or not we disable remote decoding, we will need to disable hardware decoding, otherwise we'll keep using the same drivers that may have just crashed the GPU process!
(So my suggestion in comment 4 was incomplete too, sorry about that.)

So in summary I now see two options:
- `mDecodeVideoOnGpuProcess` should instead be `mHardwareDecodeVideoOnGpuProcess`, and when it's off, the GPU process would still start the video IPC channel, but from now on the remote decoder should only create software-only decoders in the GPU process.
- If mDecodeVideoOnGpuProcess becomes false, we would *both* prevent remote decoding, and only create software-only decoders in the main/content process.

What do you think?
(In reply to Gerald Squelart [:gerald] from comment #10)
> I would argue that the main reason we wanted decoding to happen in the GPU
> process, was to catch OS&driver issues for which we cannot do anything;
> whereas software decoding is done by our software, and so if there are
> crashes we should be able to fix them.
> Of course, if we can do software decoding in the GPU process, it would be
> nice too, for better isolation.
I agree that "catching software decoder crashing" is not a strong reason to have a software decoder in the GPU process. 

But, does "software decode in GPU process" benefit performance? 

I suppose the answer is "not so much" as the most costing part is copying software decoder's result into GPU memory instead of communication between content process and GPU process.
However, I don't know much about how decoder and compositor work together, so I still raise this question.

> So in summary I now see two options:
> - `mDecodeVideoOnGpuProcess` should instead be
> `mHardwareDecodeVideoOnGpuProcess`, and when it's off, the GPU process would
> still start the video IPC channel, but from now on the remote decoder should
> only create software-only decoders in the GPU process.
> - If mDecodeVideoOnGpuProcess becomes false, we would *both* prevent remote
> decoding, and only create software-only decoders in the main/content process.
> 
> What do you think?
At this moment, I think we should have a fix to make the software decoder work on the main/content process and uplift it.
Whether have software decoder in GPU process should be left to another bug.
(In reply to Tzuhao Kuo [:kaku] (PTO on 8/9) from comment #11)
> At this moment, I think we should have a fix to make the software decoder
> work on the main/content process and uplift it.
Will try to work out a patch and look for the review. 
However, if someone who is much more familiar with this code and want to provide a quick fix, please feel free to take over.
Sounds good, thanks.

In case it's quite difficult, you could just remove the mDecodeVideoOnGpuProcess-related code and uplift that to beta as a quick fix; and then less urgently, do a proper rework in Nightly.
Looks like we won't fix 55!
Track 56+ as regression.
(In reply to Sylvestre Ledru [:sylvestre] from comment #14)
> Looks like we won't fix 55!

We're failing to play video in cases where we used to crash, so I don't expect many users will see this issue.
Flags: needinfo?(ajones)
Attachment #8897249 - Flags: review?(matt.woodrow)
Attachment #8897250 - Flags: review?(matt.woodrow)
Attachment #8897251 - Flags: review?(matt.woodrow)
Attachment #8897252 - Flags: review?(matt.woodrow)
Comment on attachment 8897249 [details]
Bug 1388309 P1 - let VideoDecoderManagerChild keep the info of whether decoding video on GPU process or not;

https://reviewboard.mozilla.org/r/168110/#review175678

::: dom/media/ipc/VideoDecoderManagerChild.h:30
(Diff revision 1)
>    static VideoDecoderManagerChild* GetSingleton();
>  
>    // Can be called from any thread.
>    static nsIThread* GetManagerThread();
>    static AbstractThread* GetManagerAbstractThread();
> +  static bool IsDecodeVideoOnGPUProcess();

This is under the comment of 'can be called from any thread', but I think it'd be racy to call it from any other thread.

I'm pretty sure we only touch it on the manager thread, so can you move it up next to GetSingleton?
Can also add an assertion into the impl to make sure we don't call it on other threads.

::: dom/media/ipc/VideoDecoderManagerChild.cpp:74
(Diff revision 1)
>                               []() {
>                                 if (sDecoderManager &&
>                                     sDecoderManager->CanSend()) {
>                                   sDecoderManager->Close();
>                                   sDecoderManager = nullptr;
> +                                 sDecodeVideoOnGPUProcess = true;

false?

::: dom/media/ipc/VideoDecoderManagerChild.cpp:154
(Diff revision 1)
> +  sDecodeVideoOnGPUProcess = aDecodeVideoOnGPUProcess;
> +
>    // Make sure we always dispatch everything in sRecreateTasks, even if we
>    // fail since this is as close to being recreated as we will ever be.
>    sDecoderManager = nullptr;
>    if (aEndpoint.IsValid()) {

Can't we just use this if condition to set sDecodeVideoOnGPUProcess?

If it's valid (and binds succeeds), then we're decoding on the GPU process, if it's invalid, then the parent process must have decided not to decode on the GPU process.

I think we can skip all of patches 2 and 3 if we do that.
Attachment #8897250 - Flags: review?(matt.woodrow)
Attachment #8897251 - Flags: review?(matt.woodrow)
Comment on attachment 8897252 [details]
Bug 1388309 - let VideoDecoderChild::InitIPDL() return false if we don't want to decode video on GPU process;

https://reviewboard.mozilla.org/r/168204/#review175680
Attachment #8897252 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8897249 [details]
Bug 1388309 P1 - let VideoDecoderManagerChild keep the info of whether decoding video on GPU process or not;

https://reviewboard.mozilla.org/r/168110/#review175678

> This is under the comment of 'can be called from any thread', but I think it'd be racy to call it from any other thread.
> 
> I'm pretty sure we only touch it on the manager thread, so can you move it up next to GetSingleton?
> Can also add an assertion into the impl to make sure we don't call it on other threads.

Done.

> false?

Right, thanks.

> Can't we just use this if condition to set sDecodeVideoOnGPUProcess?
> 
> If it's valid (and binds succeeds), then we're decoding on the GPU process, if it's invalid, then the parent process must have decided not to decode on the GPU process.
> 
> I think we can skip all of patches 2 and 3 if we do that.

All right, your suggestion makes life easier, thanks!
Attachment #8897250 - Attachment is obsolete: true
Attachment #8897251 - Attachment is obsolete: true
Comment on attachment 8897252 [details]
Bug 1388309 - let VideoDecoderChild::InitIPDL() return false if we don't want to decode video on GPU process;

https://reviewboard.mozilla.org/r/168204/#review175754

Sorry, I just realised that sDecoderManager and sDecodeVideoOnGPUProcess now contain basically the same information.

Can we just return false if (!manager), and return true if (!manager->CanSend()) ?
See Also: → 1369097
(In reply to Matt Woodrow (:mattwoodrow) from comment #26)
> Comment on attachment 8897252 [details]
> Bug 1388309 P2 - let VideoDecoderChild::InitIPDL() return false if we don't
> want to decode video on GPU process;
> 
> https://reviewboard.mozilla.org/r/168204/#review175754
> 
> Sorry, I just realised that sDecoderManager and sDecodeVideoOnGPUProcess now
> contain basically the same information.
> 
> Can we just return false if (!manager), and return true if
> (!manager->CanSend()) ?

Quite right, thanks for reminding. 

VideoDecoderManagerChild::ActorDestroy() only makes the VideoDecoderManagerChild::mCanSend = false and does not delete the instance.

Will update the patch.
Attachment #8897249 - Attachment is obsolete: true
Attachment #8897249 - Flags: review?(matt.woodrow)
(In reply to Tzuhao Kuo [:kaku] from comment #28)
> Comment on attachment 8897252 [details]
> Bug 1388309 - let VideoDecoderChild::InitIPDL() return false if we don't
> want to decode video on GPU process;
> 
> Review request updated; see interdiff:
> https://reviewboard.mozilla.org/r/168204/diff/2-3/

Not sure if you want to review it again, and don't know how to re-request review from MozReview, so set the NI request here.
Please have a look, thanks!
Flags: needinfo?(matt.woodrow)
Looks great, r=me!
Flags: needinfo?(matt.woodrow)
Thanks for the review!
Pushed by tkuo@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7be3f7ba4cda
let VideoDecoderChild::InitIPDL() return false if we don't want to decode video on GPU process; r=mattwoodrow
https://hg.mozilla.org/mozilla-central/rev/7be3f7ba4cda
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Hi Kaku,
Do you consider to uplift this to Beta56?
Flags: needinfo?(kaku)
Assignee: nobody → kaku
(In reply to Gerry Chang [:gchang] from comment #34)
> Hi Kaku,
> Do you consider to uplift this to Beta56?

Yes, we do, but we haven't get a way to automatically test it.
Keep the ni and will see how to sort out.
If figuring out a way to add automatic test takes a while, perhaps we could do manual test with SV's help for Beta and adding test to nightly.
(In reply to Blake Wu [:bwu][:blakewu] from comment #36)
> If figuring out a way to add automatic test takes a while, perhaps we could
> do manual test with SV's help for Beta ....
How could we have SV's help on Beta before the patch has been uplifted?
Flags: needinfo?(bwu)
Florin,
Can you help us test it before we request a uplift to Beta?
Flags: qe-verify+
Flags: needinfo?(florin.mezei)
Flags: needinfo?(bwu)
(In reply to Blake Wu [:bwu][:blakewu] from comment #38)
> Florin,
> Can you help us test it before we request a uplift to Beta?

Someone from Andrei's Release QA team will have a look.
Flags: needinfo?(florin.mezei) → needinfo?(andrei.vaida)
I managed to reproduce the issue on Firefox 57.0a1 (2017-08-08), under Windows 10x64.
The issue is no longer reproducible on Firefox 57.0a1 (2017-09-12), under Windows 10x64.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
Flags: needinfo?(andrei.vaida)
(In reply to Mihai Boldan, QA [:mboldan] from comment #40)
> I managed to reproduce the issue on Firefox 57.0a1 (2017-08-08), under
> Windows 10x64.
> The issue is no longer reproducible on Firefox 57.0a1 (2017-09-12), under
> Windows 10x64.

Hello, may I ask for a verification on a 32bit build also? according to bug 1369097 comment 11.
Flags: needinfo?(mihai.boldan)
I managed to test this issue also on Firefox 57.0a1 (2017-09-12)(32-bit), under Windows 10x86 and under Windows 10x64. Error is no longer displayed. Video continues to play after killing GPU process.
Please let me know if I can help any further with this bug.
Flags: needinfo?(mihai.boldan)
Can we please get an uplift request on this ASAP?
Comment on attachment 8897252 [details]
Bug 1388309 - let VideoDecoderChild::InitIPDL() return false if we don't want to decode video on GPU process;

Approval Request Comment
[Feature/Bug causing the regression]: bug 1338011
[User impact if declined]: video play won't be resumed if GPU process crashes.
[Is this code covered by automated tests?]: no
[Has the fix been verified in Nightly?]: yes
[Needs manual test from QE? If yes, steps to reproduce]: done, comment 40, comment 42
[List of other uplifts needed for the feature/fix]: no
[Is the change risky?]: no
[Why is the change risky/not risky?]: The patch only add a case to return false in VideoDecoderChild::InitIPDL(). The false return value triggers the media playback to try other platform decoder module. It's not risky because the try-other-platform-decoder-module mechanism has been in the code base for a very long time and works well.
[String changes made/needed]: no
Flags: needinfo?(kaku)
Attachment #8897252 - Flags: approval-mozilla-beta?
Comment on attachment 8897252 [details]
Bug 1388309 - let VideoDecoderChild::InitIPDL() return false if we don't want to decode video on GPU process;

Better recovery from GPU process crashes sounds good. Fix has been verified in nightly. Let's uplift this for beta 12.
Attachment #8897252 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Mihai, could you also verify this fix on beta 12?
Flags: needinfo?(mihai.boldan)
Video continues to play correctly after ending the compositor process on Firefox 56.0b12 x32 and on Firefox 56.0b12 x64 builds.
Tests were performed under Windows 10x64.
Flags: needinfo?(mihai.boldan)
Seems that it is causing bug 1374231

We should backout that 56 asap.
Ryan, wdyt?
Flags: needinfo?(ryanvm)
I don't think this should be backout, it's to prevent crash after all. No crash trace involved this part of the code.

Additionally, we don't care that the GPU process crash, as it will default back to software decoding which is exactly what we want.
Flags: needinfo?(ryanvm)
You need to log in before you can comment on or make changes to this bug.