Open Bug 1397539 Opened 3 years ago Updated 1 year ago

H264 rendering fails after several calls

Categories

(Core :: Audio/Video: GMP, defect, P2)

49 Branch
defect

Tracking

()

ASSIGNED

People

(Reporter: drno, Assigned: hankpeng)

References

(Depends on 1 open bug, )

Details

Attachments

(1 file)

After making about 10-20 calls on http://mozilla.github.io/webrtc-landing/pc_test.html with "Require H264" turned on remote video no longer gets rendered.

I repeatedly start and stop the call without reloading the page.

After several starts (10-20?) the call connects according to about:webrtc and the ICE stats indicate that data gets exchanged, but both remote video elements never start rendering anything.

When in this "blank white" state I turn off "Require H264" I get proper video rendering (with VP8) again. But as soon as I turn "Require H264" back on rendering still fails.

Reloading the page does not solve the problem. Closing the tab and trying it in another tab does not solve the problem.

This is with OpenH264 1.6, but also with the upcoming 1.7.1 releases.
I create a test page to repro this: https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html

- Open the link
- give access to cam and click the check box for future permission

On my first try it took 27 connections until I start to see white boxes for both remote videos (the bigger videos).

about:webrtc suggests that no RTP gets exchanged (neither send or received) once this stage has been reached.
Mass change P1->P2 to align with new Mozilla triage process
Priority: P1 → P2
See Also: → 1284285
See Also: → 1408258
we reproduced the issue using the demo : https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html
the log has been uploaded.
We have a couple of different issues relating to H264 now. Byron could you investigate these? I'm guessing they all stem from the same root cause.
Flags: needinfo?(docfaraday)
Not anytime soon. I might be able to give this a look next week?
Flags: needinfo?(docfaraday)
That's fine. Can we keep the needinfo so it doesn't fall off the charts?
Flags: needinfo?(docfaraday)
This seems to happen when we start encoding frames before GMP is done de-initing from the previous test... looking further, but likely a bug in either the GMP code, or the wrapper code we use to interface with GMP.
Yeah, this definitely happens when the GMPParent has not been closed yet when we try to start a new call; the encoder/decoder has been torn down, but the logic in the GMPParent thinks the h264 process is still open and doesn't re-init it. This is almost certainly a bug in either the GMP code proper, or maybe the openh264 stuff.
Pushing further back, this seems to happen when ActorDestroy has not been called on both encoder parents and both decoder parents before starting the new call. This does not appear to be an issue of not waiting long enough to start the new call; either all of the expected ActorDestroy calls are made in short order, or they are simply not made regardless of how long we wait.
Flags: needinfo?(docfaraday)
I see the following code at the end of Encode_w (the encode function that is called on the worker thread on the plugin process) in the openh264 code:


    if (!has_frame) {
      // This frame must be destroyed on the main thread.
      TrySyncRunOnMainThread (WrapTask (
                                   this,
                                   &OpenH264VideoEncoder::DestroyInputFrame_m,
                                   inputImage));
      return;
    }

    // Synchronously send this back to the main thread for delivery.
    TrySyncRunOnMainThread (WrapTask (
                                   this,
                                   &OpenH264VideoEncoder::Encode_m,
                                   inputImage,
                                   &encoded,
encoded_type));

We are sync dispatching to main when we're done encoding a frame. Here's the implementation of EncodingComplete:

  virtual void EncodingComplete() {
    shutting_down = true;

    // Release the reference to the external objects, because it is no longer safe to call them
    host_     = nullptr;
    callback_ = nullptr;
    TearDownEncoder();

    Release();
}

TearDownEncoder is implemented thusly:

   void TearDownEncoder() {
     // Stop the worker thread first
     if (worker_thread_) {
       worker_thread_->Join();
       worker_thread_ = nullptr;
     }

     // Destroy OpenH264 encoder
     if (encoder_) {
       WelsDestroySVCEncoder(encoder_);
       encoder_ = nullptr;
     }
   }


This code is called on main. If we get a call to EncodingComplete while we are encoding a frame, "worker_thread_->Join()" will never return, because the worker thread is trying to sync dispatch back to main.
I should note that, when we deadlock, the IPC logging looks like this:

[time: 1533313793767228][49950->49965] [PGMPVideoEncoderParent] Sending  PGMPVideoEncoder::Msg_Encode
[time: 1533313793767359][49950->49965] [PGMPVideoEncoderParent] Sending  PGMPVideoEncoder::Msg_EncodingComplete
[time: 1533313793767397][49965<-49950] [PGMPVideoEncoderChild] Received  PGMPVideoEncoder::Msg_Encode
[time: 1533313793769270][49965<-49950] [PGMPVideoEncoderChild] Received  PGMPVideoEncoder::Msg_EncodingComplete

This seems to line up with the openh264 bug I described in comment 12.
Blocks: 1486988
Assignee: nobody → hankpeng
This bug is supposed to be fixed by the new release v1.8.1.
Status: NEW → ASSIGNED
Depends on: 1533001
You need to log in before you can comment on or make changes to this bug.