Closed Bug 1313548 Opened 3 years ago Closed 3 years ago

Intermittent dom/media/test/test_eme_waitingforkey.html | Test timed out.

Categories

(Core :: Audio/Video: Playback, defect, P5)

defect

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox52 --- wontfix
firefox-esr52 --- affected
firefox53 --- wontfix
firefox54 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: jya)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])

Attachments

(6 files)

Component: Audio/Video → Audio/Video: Playback
:dminor, this is occurring frequently enough to cross our threshold of >30 times/week to make this a priority intermittent.

I am slightly confused here, this is in "audio/video: playback"- are there other tests in the mochitest.ini [0] file which are in the playback component?

I see this failing on linux/osx frequently- for linux it is opt/pgo, and osx it is primarily debug.  this really picked up around January 17th.


looking at the average runtimes, I see that linux32/64 opt/pgo and osx debug are running in the 40+ second range:
Test durations for dom/media/test/test_eme_waitingforkey.html on mozilla-central,mozilla-inbound,autoland between 2017-01-16 and 2017-01-23
linux32/debug-e10s:             43.67 s (42.35 s - 56.39 s over 347 runs)
linux32/debug:                  43.25 s (42.87 s - 55.20 s over 320 runs)
linux32/opt-e10s:               44.91 s (30.63 s - 318.93 s over 372 runs)
linux32/opt:                    45.16 s (30.14 s - 313.89 s over 379 runs)
linux32/pgo-e10s:               31.94 s (30.03 s - 35.46 s over 33 runs)
linux32/pgo:                    31.12 s (29.86 s - 33.02 s over 27 runs)
linux64/asan-e10s:              28.38 s (27.62 s - 30.44 s over 341 runs)
linux64/asan:                   28.15 s (27.30 s - 30.22 s over 337 runs)
linux64/debug-e10s:             42.89 s (41.90 s - 43.48 s over 330 runs)
linux64/debug:                  42.83 s (42.45 s - 43.22 s over 330 runs)
linux64/opt-e10s:               44.52 s (27.64 s - 322.47 s over 396 runs)
linux64/opt:                    40.89 s (28.24 s - 42.19 s over 382 runs)
linux64/pgo-e10s:               44.84 s (27.28 s - 330.68 s over 375 runs)
linux64/pgo:                    46.47 s (27.78 s - 327.30 s over 365 runs)
macosx64/debug-e10s:            41.02 s (26.12 s - 330.07 s over 104 runs)
macosx64/debug:                 28.05 s (26.95 s - 29.43 s over 107 runs)
macosx64/opt-e10s:              24.34 s (21.90 s - 315.88 s over 287 runs)
macosx64/opt:                   25.03 s (9.17 s - 309.60 s over 99 runs)
win32/debug-e10s:               29.08 s (27.48 s - 32.63 s over 95 runs)
win32/debug:                    29.96 s (26.83 s - 35.64 s over 94 runs)
win32/opt-e10s:                 21.56 s (20.75 s - 22.60 s over 262 runs)
win32/opt:                      21.61 s (19.37 s - 22.76 s over 94 runs)
win32/pgo-e10s:                 26.99 s (20.65 s - 322.56 s over 51 runs)
win32/pgo:                      21.04 s (18.76 s - 22.60 s over 59 runs)
win64/debug-e10s:               27.49 s (25.92 s - 31.38 s over 89 runs)
win64/debug:                    28.47 s (26.47 s - 33.97 s over 91 runs)
win64/opt-e10s:                 20.81 s (20.63 s - 21.35 s over 100 runs)
win64/opt:                      20.96 s (20.72 s - 22.44 s over 244 runs)
win64/pgo-e10s:                 20.49 s (20.30 s - 20.59 s over 64 runs)
win64/pgo:                      20.95 s (20.44 s - 40.06 s over 66 runs)


it seems obvious that we would be hitting timed out.  Possibly we split this test up or increase the timeout?


[0] https://dxr.mozilla.org/mozilla-central/source/dom/media/test/mochitest.ini
Flags: needinfo?(dminor)
All tests under dom/media/test are "playback" tests. Tests under dom/media/tests (with an "s") are related to webrtc.

Anthony is probably the best person to talk to about finding someone to look into this.
Flags: needinfo?(dminor) → needinfo?(ajones)
Chris - would you like to look into this or is Jay already taking a look?
Flags: needinfo?(ajones) → needinfo?(cpearce)
this is still failing @ 15 failures/day or 1/8th of the time.  Could we get this fixed this week?
I haven't heard anything from :cpearce in the last 2 weeks here, Anthony, can you find someone else to look at this?  I would be happy to disable this test if there is nobody who has time to look at this.
Flags: needinfo?(ajones)
Jay is looking into this. Don't disable this test, it's ensuring our stuff is working (most of the time).
Flags: needinfo?(cpearce)
Looks like it can be reproduced on Windows with 1 (stressed) core.
Attached file MakeSlow.ps1
This script sets the priority to low and the thread affinity to 1 on Nightly. 

Steps to reproduce the bug with this script:
1. Run the script in PowerShell (sorry).
2. Run ./mach mochitest dom/media/test/test_waitingforkey.html --run-until-failure
3. Wait. 
4. Hope.

I can also sometimes reproduce it on https://people-mozilla.org/~cpearce/mse-clearkey/ through numerous refreshes and judicious employment of finger crossing. 

This bug is temperamental, I can't repro with MOZ_LOGs enabled, which is unhelpful. However:
1. It looks like the 'loadeddata' event isn't fired on the video element when this bug occurs.
2. The last lines of the ClearKey log are 
```
    VideoDecoder::OutputFrame Decoder output ret=0xc00d6d72
    [011FAB78] VideoDecoder::SampleToVideoFrame()
    VideoDecoder::OutputFrame Succeeded.
    Stalled. (<-- from a printf statement at http://searchfox.org/mozilla-central/source/dom/html/HTMLMediaElement.cpp#5424)
```
Right, so the issue becomes a lot more visible (not 100% but a helluva lot closer) when you add a Sleep(10000) at the start of MediaFormatReader::WaitForData (http://searchfox.org/mozilla-central/source/dom/media/MediaFormatReader.cpp#2340). The sleep call also makes it possible to reproduce the bug without giving Firefox minimum CPU time (bye-bye PS script).

Does anyone have any suggestions as to where to go from here? I'm hopelessly lost and I'm worried the dragons will get me :P
Flags: needinfo?(ajones) → needinfo?(jyavenard)
Note: It seems to crop up more when we don't use the cache (refresh with control+shift+r or the devtools panel open and the disable cache pref set).
Flags: needinfo?(jyavenard) → needinfo?(jwwang)
https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1487705368/mozilla-inbound_yosemite_r7-debug_test-mochitest-media-e10s-bm133-tests1-macosx-build487.txt.gz

13:05:07     INFO - Test timed out. Remaining tests=audio&video tracks, both with all keys-3

The test timed out because "audio&video tracks, both with all keys-3" was not finished. However I can't find useful logs about it. Try to put SimpleTest.requestCompleteLog() to get complete logs so we may know how far "audio&video tracks, both with all keys-3" has progressed.

http://searchfox.org/mozilla-central/rev/b1044cf7c2000c3e75e8181e893236a940c8b6d2/dom/media/test/test_eme_waitingforkey.html#83
It might time out because 'ended' is not received or v.closeSessions().then() is not resolved.
Flags: needinfo?(jwwang)
Attached file output
This is a log from a locally failing test, running until failure. The test that actually timed out was "640x360 then 640x480-19." However, there's not a whole lot of useful information here. 

Note that we do not ever fire the "loadeddata" event but we successfully update all the sessions, receive encrypted/waitingforkeys events and call play.

After calling "play" a "waiting" event is fired, and a playing event is never fired.
Attached file moz-logs.zip
MOZ_LOGS for a failure on https://people-mozilla.org/~cpearce/mse-clearkey/

MOZ_LOG=GMP:5,timestamp,sync,EME:5,MSE:5,MediaSource:5,MediaBuffer:5,MediaFormatReader:5,nsMediaElement:5,nsMediaElementEvents:5,PDM:5,MediaDecoder:5,MediaSample:5
Note: With the test suite different tests fail each time, it doesn't seem so much related to the test but the timing imposed by test_eme_waitingforkey.html
15209 INFO [14:58:08.860] 640x360 then 640x480-19 Updated all sessions, loading complete -> Play
15210 INFO [14:58:08.905] 640x360 then 640x480-19 play
15211 INFO [14:58:08.905] 640x360 then 640x480-19 waiting

The last event emitted by "640x360 then 640x480-19" is 'waiting'.

Can you put |PARALLEL_TESTS=1| and re-run the test with |MOZ_LOG=MediaDecoder:4| so we might find some clues from the logs emitted by MediaDecoder and MediaDecoderStateMachine?
(In reply to jharris from comment #29)
> MOZ_LOGS for a failure on https://people-mozilla.org/~cpearce/mse-clearkey/
What kind of failure is it? Unable to play the video?
Sorry, by failure I meant reproduction of the bug. We don't ever fire the loadeddata event and the video does not play.
2017-02-23 02:51:53.694000 UTC - [MediaPlayback #1]: D/MediaSample Decoder=00D51790 OnAudioNotDecoded aError=2154692617
2017-02-23 02:51:53.735000 UTC - [Main Thread]: D/MediaDecoder Decoder=00D51790 Duration changed to 51.946666
2017-02-23 02:51:54.203000 UTC - [Main Thread]: D/MediaSource MediaSourceDecoder(00D51790)::GetBuffered: ranges=[(0.000000, 51.946666)]
2017-02-23 02:52:38.681000 UTC - [Main Thread]: D/MediaSource MediaSourceDecoder(00D51790)::Shutdown: Shutdown
2017-02-23 02:52:38.681000 UTC - [Main Thread]: D/MediaSource MediaSource(04609F00)::Detach: mDecoder=00D51790 owner=0AF12054
2017-02-23 02:52:38.681000 UTC - [Main Thread]: D/MediaDecoder Decoder=00D51790 ChangeState LOADING => SHUTDOWN

OnAudioNotDecoded aError=2154692617 which is  NS_ERROR_DOM_MEDIA_WAITING_FOR_DATA. The WaitForData promise is never resolved for MDSM so decode can't continue.

You need to look at MediaFormatReader/MediaSource (jya is the one to consult) to find out if there is something wrong to prevent the WaitForData promise from being resolved.
[28927] WARNING: Decoder=111107c00 GetMediaTime=870748 GetClock=882358 mMediaSink=1367ca7c0 state=DECODING mPlayState=3 mSentFirstFrameLoadedEvent=1 IsPlaying=1 mAudioStatus=pending mVideoStatus=pending mDecodedAudioEndTime=1439636 mDecodedVideoEndTime=367033 mAudioCompleted=0 mVideoCompleted=0 mIsPrerolling=0
IsStarted=1 IsPlaying=1, VideoQueue: finished=0 size=1, mVideoFrameEndTime=367033 mHasVideo=1 mVideoSinkEndRequest.Exists()=0 mEndPromiseHolder.IsEmpty()=0
:

Shows that neither the audio nor the video promise got resolved by the MediaFormatReader

going to have a look
Assignee: nobody → jyavenard
thanks for looking into this :jya!
So I'm able to reproduce the issue with good consistency. 

I couldn't fault the behaviour of the MediaFormatReader, the CDMCaps, EME DecoderModule nor the SamplesWaitingForKey. 

All behave as they should. 

The waiting promise is never resolved because from time to time the decoder doesn't return the first sample. So it is assumed the key is never available (we assume that the key is available once the decoder returns a sample or inform that it's input is exhausted)

So I'll check the GMP decoder tomorrow to see what's going on
Looking at the log.

I see the test timing out with a promise not being resolved.

But you can see the code continuing just fine, everything happening as it should... I'm starting to wonder if the way to reproduce the issue consistently (adding a 10s wait) is valid at all, or if there's something else causing the test to timeout even though it hasn't failed at all.
Allright, the idea of adding a 10s wait to a function is bogus, it simply makes the test takes more than 5 minutes (there's lots of videos) and it causes a timeout.

Each video having two tracks, means we wait a minimum of 20s for each test.
We have 23 videos; so that's a total of 460s wait all together.

as we have PARALLEL_TESTS=2; that's 230s to run just those tests. it won't take long to go over the 5 minutes threshold.

Here is with a log showing when a test starts, and a test end:
5 INFO [14:30:55.575] vp9 in mp4-0 Test Started
64 INFO [14:31:17.369] vp9 in mp4-0 Test Ended
69 INFO [14:31:17.398] video-only with 2 keys-1 Test Started
121 INFO [14:31:39.512] video-only with 2 keys-1 Test Ended
126 INFO [14:31:39.531] video-only with 2 keys, CORS-2 Test Started
178 INFO [14:32:01.638] video-only with 2 keys, CORS-2 Test Ended
183 INFO [14:32:01.656] audio&video tracks, both with all keys-3 Test Started
264 INFO [14:32:34.310] audio&video tracks, both with all keys-3 Test Ended
269 INFO [14:32:34.318] audio&video tracks, both with all keys, CORS-4 Test Started
349 INFO [14:33:06.921] audio&video tracks, both with all keys, CORS-4 Test Ended
354 INFO [14:33:06.926] 400x300 audio&video tracks, each with its key-5 Test Started
437 INFO [14:33:39.529] 400x300 audio&video tracks, each with its key-5 Test Ended
442 INFO [14:33:39.557] 640x480@624kbps audio&video tracks, each with its key-6 Test Started
527 INFO [14:34:12.132] 640x480@624kbps audio&video tracks, each with its key-6 Test Ended
532 INFO [14:34:12.139] 640x480@959kbps audio&video tracks, each with its key-7 Test Started
617 INFO [14:34:44.350] 640x480@959kbps audio&video tracks, each with its key-7 Test Ended
622 INFO [14:34:44.369] 640x480 then 400x300, same key (1st) per track-8 Test Started
722 INFO [14:35:16.595] 640x480 then 400x300, same key (1st) per track-8 Test Ended
727 INFO [14:35:16.608] 640x480 then 400x300, same key (2nd) per track-9 Test Started
828 INFO [14:35:49.179] 640x480 then 400x300, same key (2nd) per track-9 Test Ended
833 INFO [14:35:49.205] 640x480 with 1st keys then 400x300 with 2nd keys-10 Test Started
943 INFO [14:36:21.502] 640x480 with 1st keys then 400x300 with 2nd keys-10 Test Ended
948 INFO [14:36:21.510] 400x300 with 1st keys then 640x480 with 2nd keys-11 Test Started
1047 INFO [14:36:54.85]  400x300 with 1st keys then 640x480 with 2nd keys-11 Test Ended
1052 INFO [14:36:54.91]  640x480@959kbps with 1st keys then 640x480@624kbps with 2nd keys-12 Test Started
1152 INFO [14:37:26.740] 640x480@959kbps with 1st keys then 640x480@624kbps with 2nd keys-12 Test Ended
1157 INFO [14:37:26.764] 640x480@624kbps with 1st keys then 640x480@959kbps with 2nd keys-13 Test Started
1257 INFO [14:37:59.398] 640x480@624kbps with 1st keys then 640x480@959kbps with 2nd keys-13 Test Ended
1262 INFO [14:37:59.409] 400x300 with presentation size 533x300-14 Test Started
1345 INFO [14:38:32.44]  400x300 with presentation size 533x300-14 Test Ended
1350 INFO [14:38:32.71]  400x300 as-is then 400x300 presented as 533x300-15 Test Started
1448 INFO [14:39:04.699] 400x300 as-is then 400x300 presented as 533x300-15 Test Ended
1453 INFO [14:39:04.707] 400x225-16 Test Started
1531 INFO [14:39:37.368] 400x225-16 Test Ended
1536 INFO [14:39:37.396] 640x360-17 Test Started
1616 INFO [14:40:10.50]  640x360-17 Test Ended
1621 INFO [14:40:10.59]  400x225 then 640x360-18 Test Started
1720 INFO [14:40:42.689] 400x225 then 640x360-18 Test Ended
1725 INFO [14:40:42.696] 640x360 then 640x480-19 Test Started
1825 INFO [14:41:15.336] 640x360 then 640x480-19 Test Ended
1830 INFO [14:41:15.363] WebM vorbis audio & vp8 video clearkey-20 Test Started
1891 INFO [14:41:38.864] WebM vorbis audio & vp8 video clearkey-20 Test Ended
1896 INFO [14:41:38.871] WebM vorbis audio & vp9 video clearkey-21 Test Started
1957 INFO [14:42:02.294] WebM vorbis audio & vp9 video clearkey-21 Test Ended
1962 INFO [14:42:02.299] WebM vorbis audio & vp9 video clearkey with subsample encryption-22 Test Started
2023 INFO [14:42:17.53]  WebM vorbis audio & vp9 video clearkey with subsample encryption-22 Test Ended

Shows that each individual test completes within 30s, yet we timeout twice (not surprising with 23 videos * 30 = 11.5 minutes)

JW, how can we make it so that the test doesn't time out until a *single* test takes over 5 minutes.

I don't know enough the mochitest structure to know what to do.

I would have thought that it requires a single test to take over 5 minutes, not that the whole thing must take less than 5 minutes.

In the mean time, I've done some cleaning in the code, I'll push it for now
Flags: needinfo?(jwwang)
Note: It may be timing out on the tests legitimately but it's possible to reproduce outside the testing environment, on the MSE clearkey demo
Comment on attachment 8840874 [details]
Bug 1313548: P1. Always resolve waiting promise, when clearing waitforkey.

https://reviewboard.mozilla.org/r/115300/#review116850
Attachment #8840874 - Flags: review?(gsquelart) → review+
Comment on attachment 8840875 [details]
Bug 1313548: P2. Add additional information in dump data.

https://reviewboard.mozilla.org/r/115302/#review116852
Attachment #8840875 - Flags: review?(gsquelart) → review+
Comment on attachment 8840876 [details]
Bug 1313548: P3. Return early of we're not longer waiting for a key.

https://reviewboard.mozilla.org/r/115304/#review116854
Attachment #8840876 - Flags: review?(gsquelart) → review+
(In reply to Jean-Yves Avenard [:jya] from comment #39)
> JW, how can we make it so that the test doesn't time out until a *single*
> test takes over 5 minutes.

http://searchfox.org/mozilla-central/rev/90d1cbb4fd3dc249cdc11fe5c3e0394d22d9c680/dom/media/test/manifest.js#1586
The closet thing you can do is set PARALLEL_TESTS to 1 so whenever a test case times out, it must be the last overlong test to blame. PARALLEL_TESTS=1 also makes it a lot easier to read logs.
Flags: needinfo?(jwwang)
That wasn't my question. So I'm not sure what you're answering.

The timing above we're done with PARALLEL_TEST set to 1.

I've just shown that each individual test in itself takes around 30s to run, yet because there are 23 of them, the full test is marked as timing out.
BUT NO SINGLE TEST TAKES MORE THAN 5 MINUTES
:jya, I see 3 patches and 3 r+, should we go ahead and land this?  We still have a fairly high failure rate.
Flags: needinfo?(jyavenard)
Whiteboard: [stockwell needswork]
We can, but the real fix of this bug will come with bug 1343749
Flags: needinfo?(jyavenard)
Pushed by jyavenard@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/86489d8bd882
P1. Always resolve waiting promise, when clearing waitforkey. r=gerald
https://hg.mozilla.org/integration/autoland/rev/24ee5cc96d2a
P2. Add additional information in dump data. r=gerald
https://hg.mozilla.org/integration/autoland/rev/41a721007e82
P3. Return early of we're not longer waiting for a key. r=gerald
I will assume this fix will make a nice dent in the failure rate until bug 1343749 is landed.
Whiteboard: [stockwell needswork] → [stockwell fixed]
Duplicate of this bug: 1344255
(In reply to Joel Maher ( :jmaher) from comment #55)
> I will assume this fix will make a nice dent in the failure rate until bug
> 1343749 is landed.

Unfortunately, I don't think that will be the case. The patches do fix a potential theoretical problem, but I doubt any of the intermittents here where cause by this. The tests is just made of too many files, it would be quite easy on a slow machine to go over the 300s threshold.
You need to log in before you can comment on or make changes to this bug.