Intermittent failure in basic gUM tests | canplaythrough event never fired

VERIFIED FIXED in mozilla20

Status

()

defect
VERIFIED FIXED
7 years ago
7 years ago

People

(Reporter: jsmith, Assigned: roc)

Tracking

({intermittent-failure})

Trunk
mozilla20
All
macOS
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [getUserMedia] [blocking-gum+])

Attachments

(2 attachments)

Followup on try run from https://bugzilla.mozilla.org/show_bug.cgi?id=781534#c50 - apparently the canplaythrough event never manages to fire in the gum test on OS X 10.7 and 10.8 opt builds. Actual try run results here - https://tbpl.mozilla.org/?tree=Try&rev=f7e037dc5360. 

Apparently it's only happening with basic video test.
Blocks: 781534
Blocker cause we can't pref on gum automation without it, which is a blocker as well.
Whiteboard: [getUserMedia] [blocking-gum+]
Whiteboard: [getUserMedia] [blocking-gum+] → [getUserMedia] [blocking-gum+] [automation-blocked]
Blocks: 814807
No longer blocks: 781534
Maire, can we please get this higher prioritized given that it blocks our automation efforts?
Taking that per discussion during the 2012/12/04 WebRTC meeting.

If this is indeed MacOS specific, I'll receive a Macbook in a couple days. I won't be able to investigate until then.
Assignee: nobody → paul
Status: NEW → ASSIGNED
A bit more on this. The "canplaythrough" event fires when I run only the test test_getUserMedia_basicVideo.html. However, when I run the whole suite, I can reproduce. Now that I can repro, I'm going to dig a little further.
Another update: the root problem is at [1]. We don't go in this if(), and we should: if I log in this function, when we switch from blocked to not blocked, we don't notify the listener, that does not notify the element, and so on. I'm going to try to fix the problem, if this explanation makes sense.

[1]: http://mxr.mozilla.org/mozilla-central/source/content/media/MediaStreamGraph.cpp#813
Paul says he can't reproduce on today's build. he just kicked off a try build with the basic video test prefed on to see what happens:

https://tbpl.mozilla.org/?tree=Try&rev=4a43d50ab3cc
The try push shows that the problem still occurs, but I still can't reproduce locally. I'll try harder.
So bug 802538's try run reveals this is happening generally with any of the mochitests. And it's not just OS X opt builds either.
Summary: canplaythrough event on a media element with gum media stream with video fails to fire on OS X 10.7/10.8 opt builds → canplaythrough event on a media element with gum media stream with video or audio fails to fire intermittently
Here is the current state of understanding of this issue:

What should happen (i.e. how "canplaythrough" is fired):
- Whenever the blocking state of a stream changes (from Blocked to not blocked
or the inverse), in the |MediaStreamGraphImpl::UpdateCurrentTime|, the graph
should call |nsHTMLMediaElement::StreamListener::NotifyBlockingChanged|. This
dispatches an event on the main thread (after all the streams have been
updated), that marks the streams as unblocked in the listener, and call
|UpdateReadyStateForData| on the element, with |NEXT_FRAME_AVAILABLE| as
parameter we it is not blocked anymore, that end up calling
|nsHTMLMediaElement::ChangeReadyState| withe |HAVE_ENOUGH_DATA| as parameter,
that fires "canplaythrough".

What actually happens (i.e. why it is not fired):
- On try, and very rarely on my local machine, and only when running the full
suite (not the individual test), very frequently on MacOS opt, sometimes on
MacOS debug, the MediaStreamGraph experiences a global underrun (the code to
detect that is at the beginning of the |MediaStreamGraphImpl::UpdateCurrentTime|
method). I failed to notice this at the beginning, because LOG macros don't
output on try when doing an opt build.

This moves the |mStateComputedTime| forward to the |nextCurrentTime|, and set
all the streams as blocking from |mStateComputedTime|. The stream becomes
blocked, and everything is dispatched properly to the element.

Next time the MediaStreamGraph thread runs, the stream is logically not blocked
anymore, but should be reported as blocked at |prevCurrentTime|, and we _should_
dispatch an event to the main thread saying that we are not blocked anymore,
since we have advanced the |mCurrentTime| time to compensate the underrun that
made us set the stream as blocked. The problem that seem to happen is that
|prevCurrentTime| is wrong for some reason OR there is a bug in
TimeVarying::GetAt, and this line reports that the stream was not blocked.

>    // Save current blocked status
>    bool wasBlocked = stream->mBlocked.GetAt(prevCurrentTime);

Because we only notify the element via the listener if there is a blocking state
change from |prevCurrentTime| to |nextCurrentTime| (in the while loop) and
because the streams are marked as not blocked during this period, the
|wasBlocked != blocked| condition is never true, and we don't notify the
element.

When outputing the values, |prevCurrentTime| in the function call where we fail
to notify the element is the |nextCurrentTime| of the previous call, so this
seems alright to me. Then it is normal that the stream is not blocked at that
time, because we specifically updated the graph time to bring it to a point
where it is not blocked to compensate for the underrun, so I believe the
underrun compensation code is wrong: it should set the stream as blocked until
the next time we run the graph, so we report the the stream was blocked at
|prevCurrentTime| in the next run.



Hopefully this makes sense. I have a 3g connection this week, so I'll be able to
check my email, please tell me if you need clarifications or something.
I suspect the problem happens like this:
1) The global underrun happens and we set mCurrentTime and mStateComputedTime to the same value. The stream is marked blocking from prevCurrentTime onward and we notify the element.
2) After that, we recompute blocking status for the stream and decide it should not be blocked from mStateComputedTime onward. So we set that in its mBlocking. Note that mBlocking is now false at mCurrentTime!
3) In the next iteration of UpdateCurrentTime, we proceed normally. wasBlocked is set to false since at prevCurrentTime mBlocking is false.

Basically the problem is that we have forgotten our old blocking status; it's been overwritten when mStateComputedTime == mCurrentTime. We should store it separately on each stream instead of utilizing mBlocked.
Thanks Paul, I wouldn't have figured that out without your data.
Attachment #696822 - Flags: review?(rjesup) → review+
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12)
> Created attachment 696822 [details] [diff] [review]
> fix (hopefully)
> 
> https://tbpl.mozilla.org/?tree=Try&rev=05b9fe9c8865

Good news is that the canplaythrough event is firing correctly in the try run. Bad news is that we're still failing in the automation, which I didn't initially expect to see - when I originally ran this try run a while back, the canplaythrough event + leaks were the main issue. But now we're seeing the following with the basicVideo test:

1. [NEW] played length test is failing is consistently
2. [NEW] 1 of the the tests failed with the stream and media element's time not greater than zero
3. [NEW] Lots and lots of leaks in some of the test runs

Any ideas? What's causing these failures?
Maybe the "convert to TrackUnionStream" patch.
Okay. Sounds like it's safe to land this patch then.

May be worth kicking off another try run without the "convert to trackunionstream" patch though to be sure.
So, from that try run with all gum and peerconnections mochitests enabled, it appears that the errors roc saw  are in fact real (and new).  As validation, I'm running a Try with roc's patch backed out:

https://tbpl.mozilla.org/?tree=Try&rev=1bc99ea0c320
Backout was almost certainly due to other patches backed out with it for scalable fonts (since the errors are in SVG).  However, we still have issues in the gUM mochitests with this patch applied (see the try pushes)
Comparing the results across each try run, here's what I see:

1. This patch definitely causes the regression with the play length test - didn't happen before this patch, but it happens with this patch applied.

2. There's leaks evident in both try runs, but it doesn't appear to be caused by this patch - my hunch is that these leaks are coming from peer connection. Filed bug 825747 on this.

3. There's crashes in both try runs, but I don't think it's from this patch - probably coming from the peer connection tests. I believe there's some bugs on this already.

4. There's a peer connection test timeout in the try run with this patch, which is a known bug - bug 823056.

So why is the play length test failing with this patch?
Note from my push with all desktop platforms that only Mac had the new gum Mochitest problems (the Windows failures appear to be known bugs with patches r+'d and waiting to land by jib and ekr)

Interesting....
Ok, the test that failed on 4/6 Mac runs from my try build was "Played length shall be one - got 0, expected 1".  This is tested in the canPlayThrough callback, so my guess is that this is either a timing hole, or (more likely) that it's incorrect to check that data has played in canPlayThrough; that there are instances (underuns) where data may not have played (yet).  Roc?
Ok, checked roc's Try as well.  All the errors there appear to be variants of the same problem - data not having played when canplaythrough fires.  The leaks are all due to the getUserMedia singleton and audio/video sources, but they appear to only happen occasionally.  I'll look at that
Hmm.  Pushing a new try.  The tree roc used for his try *seems* to have been last updated on Dec 7th, before my gum leak fixes landed.  Hard to tell from Try, so I'll push a new set with jib's and ekr's fixes
(In reply to Randell Jesup [:jesup] from comment #27)
> https://tbpl.mozilla.org/?tree=Try&rev=5ba93a7ba678

Hmm...I don't see a patch on that try run to pref on the mochitests on mozilla central. Am I misreading something or was the try run here setup correctly?
(In reply to Jason Smith [:jsmith] from comment #22)
> Comparing the results across each try run, here's what I see:
> 
> 1. This patch definitely causes the regression with the play length test -
> didn't happen before this patch, but it happens with this patch applied.

Right, see above.

> 2. There's leaks evident in both try runs, but it doesn't appear to be
> caused by this patch - my hunch is that these leaks are coming from peer
> connection. Filed bug 825747 on this.

Roc's leak I think is because an old base rev (see above).  New push should show.
My leak (one Win debug build) appears to be due to a possible race condition where the PeerConnection leak cleanup patch doesn't finish shutting down Signaling before XPCOM shutdown finishes.  Probably need to add synchronization to the shutdown of signaling.

> 3. There's crashes in both try runs, but I don't think it's from this patch
> - probably coming from the peer connection tests. I believe there's some
> bugs on this already.

Yup.  Including some ready-to-land patches in the Try push.  None seem new (though we don't have a patch for the Mac poisoned-write crash; we need a better backtrace of it).

> 4. There's a peer connection test timeout in the try run with this patch,
> which is a known bug - bug 823056.
> 
> So why is the play length test failing with this patch?
(In reply to Jason Smith [:jsmith] from comment #28)
> (In reply to Randell Jesup [:jesup] from comment #27)
> > https://tbpl.mozilla.org/?tree=Try&rev=5ba93a7ba678
> 
> Hmm...I don't see a patch on that try run to pref on the mochitests on
> mozilla central. Am I misreading something or was the try run here setup
> correctly?

Oops

https://tbpl.mozilla.org/?tree=Try&rev=33bd2d9bca6e
https://tbpl.mozilla.org/?tree=Try&rev=8680e3def608

Once more with feeling... and roc's patch too (forgot it was backed out)
(In reply to Randell Jesup [:jesup] from comment #31)
> https://tbpl.mozilla.org/?tree=Try&rev=8680e3def608
> 
> Once more with feeling... and roc's patch too (forgot it was backed out)

This one has Roc's patch but doesn't have the patch to pref on the tests.
Try hides any patches that were in your previous Try.  You'd have to look at the changesets and then click on the Parent links to know exactly what's in a Try.  (That's how I found roc's Try was based in Dec 7th).
Filed bug 825764 for the new? crash from my try.

No leaks, so I think roc's gum leak was an artifact.  There is an intermittent link in windows debug (at least) of sipcc, though it didn't show here.
Still seeing this repeatedly:

ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_getUserMedia_basicVideoAudio.html | Played length shall be one - got 0, expected 1

And saw this once:

39 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_getUserMedia_basicVideoAudio.html | Stream's time should be greater than zero
40 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_getUserMedia_basicVideoAudio.html | MediaElement time shall be greater than zero
Right.  We're waiting on roc to either do further work on canplaythroughevent, or more likely indicate why the tests are wrong to assume that media has played at that point (or without checking something else like the video element state).

All of those appear to be fallout from the canplaythroughevent change (and all the same issue - the media hasn't played yet due to blocking).  The other two happen once in a while, the first happens on 50-80% of the Mac try runs.
canplaythrough can fire pretty much right away after the stream has been set up and started playing, but going through the event loop means it can be delayed a long time.

So these tests are not valid:

      is(self.mediaElement.ended, false,
        "Media element should not have ended");
(canplaythrough might have been delayed until after the media element ended.)

      is(self.mediaElement.played.length, 1, "Played length shall be one");
This isn't correct either. We'll receive a message from the MSG when the stream unblocks, which when delivered to the main thread queues dispatch of canplaythrough, but then if the main thread is able to fire canplaythrough before the MSG ticks again, we will still not have been notified that any data has actually been played.
Okay. I'll hack together a patch quickly to remove those tests.
Attachment #696900 - Flags: review?(roc)
Comment on attachment 696900 [details] [diff] [review]
Remove Incorrect Gum Tests

Review of attachment 696900 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/media/tests/mochitest/mediaStreamPlayback.js
@@ -75,5 @@
> -        is(self.mediaElement.played.start(0), 0,
> -          "Played start shall be zero");
> -        is(self.mediaElement.played.end(0), self.mediaElement.currentTime,
> -          "End shall be current time");
> -      }

You can restore this test to check "length == 0 || (length == 1 && played.end(0) == currentTime)", but removing is good for now.
Attachment #696900 - Flags: review?(roc) → review+
Sounds good. Let's land each patch now that we know we're good to land after doing some try build checks and fixing the gum tests.
Keywords: checkin-needed
Still need to reland Roc's patch.
Keywords: checkin-needed
Keywords: checkin-needed
Summary: canplaythrough event on a media element with gum media stream with video or audio fails to fire intermittently → Intermittent failure in basic gUM tests | canplaythrough event never fired
https://hg.mozilla.org/mozilla-central/rev/b4f5631d7ce7
https://hg.mozilla.org/mozilla-central/rev/9f493439be84
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Verified via try run.
Status: RESOLVED → VERIFIED
Whiteboard: [getUserMedia] [blocking-gum+] [automation-blocked] → [getUserMedia] [blocking-gum+]
You need to log in before you can comment on or make changes to this bug.