Closed Bug 655648 Opened 9 years ago Closed 9 years ago

Intermittent test_play_twice.html | Expect at least one playing event

Categories

(Core :: Audio/Video, defect)

x86
Windows XP
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla7

People

(Reporter: philor, Assigned: cpearce)

References

Details

(Keywords: intermittent-failure, Whiteboard: [inbound])

Attachments

(3 files)

Hands up, everyone who thinks it's a coincidence that this is the second randomorange to be filed about WinXP in the three pushes since http://hg.mozilla.org/mozilla-central/rev/615cffa67cfa and http://hg.mozilla.org/mozilla-central/rev/a4bf69cf2f78 landed.

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1304911091.1304911882.14792.gz
Rev3 WINNT 5.1 mozilla-central opt test mochitests-1/5 on 2011/05/08 20:18:11
s: talos-r3-xp-051

64204 INFO TEST-PASS | /tests/content/media/test/test_play_twice.html | 320x240.ogv shouldn't report NETWORK_LOADED
64205 INFO TEST-PASS | /tests/content/media/test/test_play_twice.html | 320x240.ogv checking playback has ended
64206 ERROR TEST-UNEXPECTED-FAIL | /tests/content/media/test/test_play_twice.html | Expect at least one playing event
64207 INFO TEST-PASS | /tests/content/media/test/test_play_twice.html | Should be testing file we think we're testing... - "320x240.ogv" should equal "320x240.ogv"
This bug is probably a regression from landing http://hg.mozilla.org/mozilla-central/pushloghtml?changeset=615cffa67cfa in bug 566779. 

I think the reason we're finishing test_play_twice without firing at least one "playing" event is that we're never reaching readyState HAVE_ENOUGH_DATA while we're potentially playing. We were previously reaching HAVE_ENOUGH_DATA while before metadata was finished loading, but with 615cffa67cfa we stopped that. So this patch changes us to move to HAVE_ENOUGH_DATA when we reach "loadeddata" if the entire resource it loaded.

Greenish on try: http://tbpl.mozilla.org/?tree=Try&rev=b7140e938a35
Assignee: nobody → chris
Status: NEW → ASSIGNED
Attachment #531770 - Flags: review?(roc)
Comment on attachment 531770 [details] [diff] [review]
Patch: Go to readyState HAVE_ENOUGH_DATA if resource is fully loaded at loadeddata state.

Review of attachment 531770 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #531770 - Flags: review?(roc) → review+
(In reply to comment #14)
> mfinkle%mozilla.com
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1305214992.1305216109.
> 5237.gz
> Rev3 WINNT 5.1 mozilla-central opt test mochitests-1/5 on 2011/05/12 08:43:12
> 
> s: talos-r3-xp-024
> 64769 ERROR TEST-UNEXPECTED-FAIL |
> /tests/content/media/test/test_play_twice.html | Expect at least one playing
> event

This is after my patch landing, looks like this bug isn't fixed. :(
I think I figured this out. Before bug 650994 landed, we'd decode all the ogg data up to the next non -1 granulepos during metadata loading. For vorbis this would be an entire page, which could be a second or more of data. But with bug 650994 landed, we'll only decode data in the decode loop, so we'll hit readyState HAVE_METADATA without much data.

Critically (with 650994 landed) for small-shot.ogg, which has 276ms of audio in its first and only page, we won't decode the entire media before switching to HAVE_METADATA. This means that nsBuiltinDecoderStateMachine::HasFutureAudio() is capable of never reporting that there's future playable audio, since it only does so if we have more than LOW_AUDIO_USECS decoded audio, or we've decoded the entire file. LOW_AUDIO_USECS is 300ms, greater than the duration of the file. This means we may never switch to readyState HAVE_FUTURE_DATA, and so never fire a "playing" event.

I have a few ideas for fixes, will experiment...
(In reply to comment #35)
> I think I figured this out. 

I'm convinced this is correct. The following Try push has a 0 LOW_AUDIO_USECS threshold, and it's all green on numerous Win mochitest 1/5 runs:
http://tbpl.mozilla.org/?tree=Try&rev=d8f92969d921

I tried lowering LOW_AUDIO_USECS to 40ms, but we still had failures:
http://tbpl.mozilla.org/?tree=Try&rev=39ff85f5007f

I've since realised that we don't actually want to lower LOW_AUDIO_USECS, since that's used in the "audio pump" logic, so we'd be better to just change HasFutureAudio() to compare the amount of future audio against 0 rather than LOW_AUDIO_USECS.

I'll update my fix and push it to try again just to be sure.

We should backout http://hg.mozilla.org/mozilla-central/rev/46f139db2018 when we land this fix.
I'm also pretty convinced this issue already existed with WebM files, since it never needed to decode large chunks of data to timestamp frames/samples, we just never noticed the problem since we don't have a WebM test file with audio duration less LOW_AUDIO_USECS in our test suite.
(In reply to comment #52)
> (In reply to comment #35)
> > I think I figured this out. 
> 
> I'm convinced this is correct. The following Try push has a 0
> LOW_AUDIO_USECS threshold, and it's all green on numerous Win mochitest 1/5
> runs:
> http://tbpl.mozilla.org/?tree=Try&rev=d8f92969d921
> 
> I tried lowering LOW_AUDIO_USECS to 40ms, but we still had failures:
> http://tbpl.mozilla.org/?tree=Try&rev=39ff85f5007f
> 
> I've since realised that we don't actually want to lower LOW_AUDIO_USECS,
> since that's used in the "audio pump" logic, so we'd be better to just
> change HasFutureAudio() to compare the amount of future audio against 0
> rather than LOW_AUDIO_USECS.

I'm wrong, setting the threshold in HaveFutureAudio() only doesn't work:

http://tbpl.mozilla.org/?tree=Try&rev=9f8cb4ce28c0

This implies that the other place that uses LOW_AUDIO_USECS is where the problem is; the keyframe skipping logic. With LOW_AUDIO_USECS=0, the chance of skipping to the next keyframe goes down... So maybe we're skipping too much now?
Chris, did your investigations lead to any conclusions here?
Ehsan: I'm on holiday at moment, so haven't looked at this since my comment 54. I'm back to work on 6 June.
Disabling the logic which activates skip-to-next-keyframe appears to prevent the failure:

http://tbpl.mozilla.org/?tree=Try&rev=9ec62d7cc0be
(For some reason not all mtest-1 on win32-opt runs are showing up, but I started about 20 and they didn't fail).

My guess is that the skip to keyframe logic is activating, causing us to not have enough decoded frames, causing HaveNextFrameData() to not report us having future data, causing us to not get to readyState > HAVE_CURRENT_DATA, causing us to not fire all state transition events, causing the test failure...

So, why do we activate the keyframe skipping logic...
You see a few more of them on http://dev.philringnalda.com/tbpl/?tree=Try&rev=9ec62d7cc0be since unlike tbpl.m.o I don't believe that all builds and tests on a push start within 12 hours of the push, but according to https://build.mozilla.org/buildapi/self-serve/try/rev/9ec62d7cc0be rather a lot of them started at the same second, which means you repeatedly hit bug 590526, and as soon as Tinderbox heard about another finished result starting the same second, it forgot about the first result.
We're still enqueueing theora frames which have an end time less than the current playback position. (WebM and RAW don't do this already). This could fill up the video queue with late frames, and cause us to put the decoder to sleep, even though we actually need keep decoding. When we come to AdvanceFrame(), we'll empty the queue since it's only full of late frames, and if we never get a timely frame in AdvanceFrame(), we may never advance past HAVE_CURRENT_DATA, and never fire all the playing events, causing the failure.

Seems to be green on try:
Win32: http://tbpl.mozilla.org/?tree=Try&rev=9adf28ae7843
All: http://tbpl.mozilla.org/?tree=Try&rev=4f640b72827c
There's a different random failure on WinXP Debug:
66847 ERROR TEST-UNEXPECTED-FAIL | /tests/content/media/test/test_play_events.html | Last event should be canplaythrough for http://mochi.test:8888/tests/content/media
test/320x240.ogv - got "play", expected "canplaythrough"

Which looks like the test_play_events variant of bug 655630. I had hoped the fix for this bug would fix that, but oh well.
Attachment #537967 - Flags: review?(roc)
Comment on attachment 537967 [details] [diff] [review]
Patch: don't enqueue late Theora frames

Review of attachment 537967 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #537967 - Flags: review?(roc) → review+
(In reply to comment #127)
> You see a few more of them on
> http://dev.philringnalda.com/tbpl/?tree=Try&rev=9ec62d7cc0be since unlike
> tbpl.m.o I don't believe that all builds and tests on a push start within 12
> hours of the push

You should upstream that patch! :)
> Comment on attachment 537967 [details] [diff] [review] [review]
> Patch: don't enqueue late Theora frames

http://hg.mozilla.org/mozilla-central/rev/daf167d2e21e
Chris, should this now be marked as fixed?
No, this is still occurring unfortunately (see comment 140). Coincidentally it looks like I fixed bug 655630 though.
This may have been fixed on m-c by the landing of bug 664354:
http://hg.mozilla.org/mozilla-central/rev/2185b561e5e1 @ 2011-06-14 19:49:51 PDT

If there's no more occurrences over the next few days, we can close this bug.
Let's try to be optimistic, and reopen if this happens again!  Thanks, Chris :)
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Try run for 1d8a948082c4 is complete.
Detailed breakdown of the results available here:
    http://tbpl.mozilla.org/?tree=Try&rev=1d8a948082c4
Results:
    success: 5
    warnings: 1
Total buildrequests: 6
* The file the test is failing on (320x240.ogv) is 266ms duration. We only activate skip-to-keyframe if we've got less than 5s of undecoded data available, so we can never activate skip-to-keyframe, so my comment 54 that the skip-to-keyframe must be activating is wrong.
* We fire the playing etc events in nsHTMLMediaElement::ChangeReadyState if we've got future frames/samples.
* I think the problem could be a race between the decode thread producing frames and the state machine thread consuming frames. The decode thread must be unable to get ahead. Maybe the test machine is under load.
* I've tried updating readyState after decode, it may help: http://tbpl.mozilla.org/?tree=Try&rev=d28af76cfc69 This may not _fix_ the underlying race condition, but may help alleviate it. It's worth a shot at this stage I think!
* I also noticed a timeout in test_timeupdate_small_files (on WinXP, not Linux). I think this was caused the logic in the "our queues full, we should wait" check at the end of DecodeLoop not handling the case where we reached end of stream(s) properly. We should not wait in that loop if we've just hit the end of both audio and video streams. Normally the decode thread would get a notification when the audio or state machine popped the last frame/sample off the queue, and then wake up and exit the loop in this case, but I think the decode thread can end up waiting after the last such notification has come in, causing a timeout.
Attachment #542625 - Flags: review?(kinetik)
Attachment #542625 - Flags: review?(kinetik) → review+
(In reply to comment #247)
> Created attachment 542625 [details] [diff] [review] [review]
> Patch: Update ready state after we decode data

Landed on mozilla-inbound:
http://hg.mozilla.org/integration/mozilla-inbound/rev/fce9b0ca58c3
Whiteboard: [orange] → [orange][inbound]
http://hg.mozilla.org/mozilla-central/rev/fce9b0ca58c3
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla7
Hi guys.
How can this be verified?T
Thanks
It can be considered verified if there are no more instances of the test failure on branches which have the fix.
Whiteboard: [orange][inbound] → [inbound]
You need to log in before you can comment on or make changes to this bug.