Closed Bug 1403307 Opened 3 years ago Closed 2 years ago

Intermittent dom/media/test/test_mediarecorder_pause_resume_video.html | Last frame should be green

Categories

(Core :: Audio/Video: Recording, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox59 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: bryce)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disabled][Pernosco])

Attachments

(1 file)

this is failing at a very high failure rate, 135 failures in the last week alone- all on debug, but a variety of platforms.

here is a recent windows10 log:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=134865118&lineNumber=10087

and related text from the log:
03:24:40     INFO - TEST-PASS | dom/media/test/test_mediarecorder_pause_resume_video.html | Media recorder should be paused 
03:24:40     INFO - Drawing color 0,0,255,1
03:24:40     INFO - Buffered messages logged at 03:24:38
03:24:40     INFO - Drawing color 0,255,0,1
03:24:40     INFO - Buffered messages logged at 03:24:39
03:24:40     INFO - Got 'dataavailable' event
03:24:40     INFO - Got 'stop' event
03:24:40     INFO - TEST-PASS | dom/media/test/test_mediarecorder_pause_resume_video.html | Media recorder should be incative after stop 
03:24:40     INFO - TEST-PASS | dom/media/test/test_mediarecorder_pause_resume_video.html | Expected 1 dataavailable event 
03:24:40     INFO - TEST-PASS | dom/media/test/test_mediarecorder_pause_resume_video.html | Should have gotten a data blob 
03:24:40     INFO - TEST-PASS | dom/media/test/test_mediarecorder_pause_resume_video.html | video element should be same size as canvas once metadata is loaded 
03:24:40     INFO - Buffered messages finished
03:24:40     INFO - TEST-UNEXPECTED-FAIL | dom/media/test/test_mediarecorder_pause_resume_video.html | Last frame should be green 
03:24:40     INFO -     startTest/mediaRecorder.onstop/video.onended@dom/media/test/test_mediarecorder_pause_resume_video.html:104:7
03:24:40     INFO -     EventHandlerNonNull*startTest/mediaRecorder.onstop@dom/media/test/test_mediarecorder_pause_resume_video.html:103:5
03:24:40     INFO -     EventHandlerNonNull*startTest@dom/media/test/test_mediarecorder_pause_resume_video.html:81:3
03:24:40     INFO -     @dom/media/test/test_mediarecorder_pause_resume_video.html:125:1
03:24:40     INFO - GECKO(1416) | MEMORY STAT | vsize 1700MB | vsizeMaxContiguous 132540028MB | residentFast 109MB | heapAllocated 21MB
03:24:40     INFO - TEST-OK | dom/media/test/test_mediarecorder_pause_resume_video.html | took 4255ms


:jib, I see you are the triage owner for av:recording, can you help prioritize this bug and get this in the queue for the right person to look at it?
Flags: needinfo?(jib)
Whiteboard: [stockwell needswork]
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f8a337cc1e67
Disable dom/media/test/test_mediarecorder_pause_resume_video.html on debug for frequent failures. r=me, a=test-only
please remember to re-enable this test when working on it.
Keywords: leave-open
Whiteboard: [stockwell needswork] → [stockwell disabled]
Bryce, this is a new test you added in bug 1354457. Please take a look. We should turn off the test temporarily if there's not a quick fix, then we can lower priority if needed.
Assignee: nobody → bvandyk
Rank: 10
Flags: needinfo?(jib) → needinfo?(bvandyk)
Priority: P5 → P1
I suspect this is a regression due to bug 1296531 landing based on the dates involved. I can repro locally, but it's fairly intermittent.

Rough test schedule:
- Draw red
- Record (red)
- Pause
- Draw blue
- Draw green
- Resume (green)
- Stop

The screenshots indicate the last frame of the video is red, which means the recorder hasn't resumed and captured any of the green frames. Don't know why yet, will continue to look.
Flags: needinfo?(bvandyk)
If you adjust the timing of drawing green vs resume() (i.e., increase the nr of rAFs in between, say 72 instead of 62), does that affect the result?
Hard to be sure due to the intermittence (100s of iterations without failures for values I know can fail), but here are some results we me changing the resume value (while leaving draw green at 60):
62 (default): can repro
63: can repro
64: can repro
65+: haven't been able to repro

If I change the value of when we draw green to 30 or 50 I'm unable to repro at all across a range of resume values.

Due to the intermittence I'd talk these with some big grains of salt, but hopefully it's more useful than no data.
Clearly there's a larger than anticipated delay between drawing a frame (it should be captured at the end of requestAnimationFrame()), until it gets into MSG (on the next MSG iteration in NotifyPull()) and further all the way through MediaRecorder (the encoder TaskQueue basically).

Some logging that would be useful would be when a new frame arrives to the TaskQueue vs Resume(); when a new frame is handled by the TaskQueue vs Resume(); and when we forward mLastChunk to the encoder on end-of-stream.
Could you elaborate on what you think is going on with the pause and resume and the delays. I would have thought if we're seeing delays in delivery of frames then shifting the resume to take place later would not alleviate the issue.

Questions to help clarify my understanding and allow me to look at/add appropriate logs: by the time requestAnimationFrame has run draw(), the color of the canvas will be stable, as will the state of the recorder? How does/should this interact with the MSG and drivers? Do we have any guarantees that drivers don't tick during JS execution (I believe not)?

The MSG performs a pull on the data from the canvas (via NotifyPull)? This data is used by SetCurrentFrames (MSG thread) and sent to AppendVideoSegment (MediaRecorder thread), further down the MSG processing pipeline this data is again processed by NotifyQueuedChanges (MSG thread) and sent to AdvanceCurrentTime (MediaRecorder thread), which we use to null queued data when suspended? As above, is there any sync here between the main thread and these? Does the MSG need to hit mainthread to pull form the canvas?

Chrome seem to do some syncing to make sure their recorder and canvas are synced (this pen is an extreme example: https://codepen.io/SingingTree/full/eGZdQX/). I need to investigate what happens if we're doing drawing with workers. Despite that the regression concerns me, I wonder also if this test is entirely sane given what I've learnt since first writing it.
When we draw() in the requestAnimationFrame() callback, there's not really an async step until the capture happens. It happens late during the RefreshDriver Tick on main thread (rAF happens earlier).

However, when we capture a frame (let's call this A), we don't immediately append it to the captured track, [1].

It takes until the next MSG iteration's NotifyPull() to append the new frame, [2] (let's call this B).
(note that in the long term this will change to append immediately -- we need to get rid of the notion of a duration from video chunks first, and preferably some locking from AppendToTrack. But doing this should help your codepen with sync.)

When this AppendToTrack() happens, MediaRecorder receives the new frame immediately (since it uses a direct track listener) and dispatches the frame to VideoTrackEncoder::AppendVideoSegment() on the encoder thread, [3] (let's call this C)

As far as races go, a frame takes this path:
A - MainThread
B - MSG thread
C - Encoder thread

While Resume() takes this path:
MediaRecorder::Resume() - MainThread
TrackEncoder::Resume() - Encoder thread

B is the difference here, and B is dependent on audio callbacks for the current output device (when audio is present in the MSG, or on a timer if there's no audio -- we try to keep iterations to 10ms, [4]). Since you managed to get rid of the intermittent behavior by only changing timing of draw() or Resume(), I assume the race is somehow between B and Resume(), and not between B and Stop().

As the next step I'd look for anomalies in the timing between B and Resume() as they come in on the encoder thread (wherever you can repro). If you see a pattern there it should be simple to reproduce in a gtest and if needed debug that with RR.

VideoTrackEncoder::AdvanceCurrentTime might be interesting to log too, since it also reads mSuspended, and it tells you how long the MSG iterations are.



[1] http://searchfox.org/mozilla-central/rev/40b456626e2d0409b7034768b4d9526fc7235ea4/dom/media/CanvasCaptureMediaStream.cpp#208
[2] http://searchfox.org/mozilla-central/rev/40b456626e2d0409b7034768b4d9526fc7235ea4/dom/media/CanvasCaptureMediaStream.cpp#65
[3] http://searchfox.org/mozilla-central/rev/40b456626e2d0409b7034768b4d9526fc7235ea4/dom/media/encoder/TrackEncoder.cpp#453
[4] http://searchfox.org/mozilla-central/rev/40b456626e2d0409b7034768b4d9526fc7235ea4/dom/media/GraphDriver.h#38
Priority: P1 → P2
All current failures are linux/asan. There was an increase in failure frequency around Dec 18.
Keywords: leave-open
Whiteboard: [stockwell disabled] → [stockwell needswork]
There have been 34 failures in the last week.
This is failing on Linux x64 / asan, with a single exception on linux64-stylo-disabled / opt.

Here is a recent log file:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=153534459&lineNumber=10999

and a snippet with the failure:

[task 2017-12-28T22:41:41.898Z] 22:41:41     INFO - TEST-PASS | dom/media/test/test_mediarecorder_pause_resume_video.html | Media recorder should be incative after stop 
10995
[task 2017-12-28T22:41:41.899Z] 22:41:41     INFO - TEST-PASS | dom/media/test/test_mediarecorder_pause_resume_video.html | Expected 1 dataavailable event 
10996
[task 2017-12-28T22:41:41.900Z] 22:41:41     INFO - TEST-PASS | dom/media/test/test_mediarecorder_pause_resume_video.html | Should have gotten a data blob 
10997
[task 2017-12-28T22:41:41.900Z] 22:41:41     INFO - TEST-PASS | dom/media/test/test_mediarecorder_pause_resume_video.html | video element should be same size as canvas once metadata is loaded 
10998
[task 2017-12-28T22:41:41.903Z] 22:41:41     INFO - Buffered messages finished
10999
[task 2017-12-28T22:41:41.904Z] 22:41:41     INFO - TEST-UNEXPECTED-FAIL | dom/media/test/test_mediarecorder_pause_resume_video.html | Last frame should be green 
11000
[task 2017-12-28T22:41:41.904Z] 22:41:41     INFO -     startTest/mediaRecorder.onstop/video.onended@dom/media/test/test_mediarecorder_pause_resume_video.html:104:7
11001
[task 2017-12-28T22:41:41.904Z] 22:41:41     INFO -     EventHandlerNonNull*startTest/mediaRecorder.onstop@dom/media/test/test_mediarecorder_pause_resume_video.html:103:5
11002
[task 2017-12-28T22:41:41.904Z] 22:41:41     INFO -     EventHandlerNonNull*startTest@dom/media/test/test_mediarecorder_pause_resume_video.html:81:3
11003
[task 2017-12-28T22:41:41.904Z] 22:41:41     INFO -     @dom/media/test/test_mediarecorder_pause_resume_video.html:125:1
11004
[task 2017-12-28T22:41:41.904Z] 22:41:41     INFO - GECKO(1754) | MEMORY STAT | vsize 20973779MB | residentFast 1200MB
11005
[task 2017-12-28T22:41:41.905Z] 22:41:41     INFO - TEST-OK | dom/media/test/test_mediarecorder_pause_resume_video.html | took 4300ms
This has failed 42 times in the last 7 days. It occurred only on Linux x64 platform affecting asan build type.
Failing test suite: only on opt-mochitest-media-e10s.

Link to a recent log: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=153630587&lineNumber=11024

Part of that log:

[task 2017-12-30T23:07:28.139Z] 23:07:28     INFO - Got 'dataavailable' event
11018
[task 2017-12-30T23:07:28.139Z] 23:07:28     INFO - Got 'stop' event
11019
[task 2017-12-30T23:07:28.140Z] 23:07:28     INFO - TEST-PASS | dom/media/test/test_mediarecorder_pause_resume_video.html | Media recorder should be incative after stop 
11020
[task 2017-12-30T23:07:28.141Z] 23:07:28     INFO - TEST-PASS | dom/media/test/test_mediarecorder_pause_resume_video.html | Expected 1 dataavailable event 
11021
[task 2017-12-30T23:07:28.142Z] 23:07:28     INFO - TEST-PASS | dom/media/test/test_mediarecorder_pause_resume_video.html | Should have gotten a data blob 
11022
[task 2017-12-30T23:07:28.143Z] 23:07:28     INFO - TEST-PASS | dom/media/test/test_mediarecorder_pause_resume_video.html | video element should be same size as canvas once metadata is loaded 
11023
[task 2017-12-30T23:07:28.143Z] 23:07:28     INFO - Buffered messages finished
11024
[task 2017-12-30T23:07:28.143Z] 23:07:28     INFO - TEST-UNEXPECTED-FAIL | dom/media/test/test_mediarecorder_pause_resume_video.html | Last frame should be green 
11025
[task 2017-12-30T23:07:28.144Z] 23:07:28     INFO -     startTest/mediaRecorder.onstop/video.onended@dom/media/test/test_mediarecorder_pause_resume_video.html:104:7
11026
[task 2017-12-30T23:07:28.145Z] 23:07:28     INFO -     EventHandlerNonNull*startTest/mediaRecorder.onstop@dom/media/test/test_mediarecorder_pause_resume_video.html:103:5
11027
[task 2017-12-30T23:07:28.145Z] 23:07:28     INFO -     EventHandlerNonNull*startTest@dom/media/test/test_mediarecorder_pause_resume_video.html:81:3
11028
[task 2017-12-30T23:07:28.145Z] 23:07:28     INFO -     @dom/media/test/test_mediarecorder_pause_resume_video.html:125:1
11029
[task 2017-12-30T23:07:28.146Z] 23:07:28     INFO - GECKO(1639) | MEMORY STAT | vsize 20973777MB | residentFast 905MB
11030
[task 2017-12-30T23:07:28.146Z] 23:07:28     INFO - TEST-OK | dom/media/test/test_mediarecorder_pause_resume_video.html | took 4301ms
11031
[task 2017-12-30T23:07:28.147Z] 23:07:28     INFO - TEST-START | dom/media/test/test_mediarecorder_playback_can_repeat.html
11032
[task 2017-12-30T23:07:28.296Z] 23:07:28     INFO - GECKO(1639) | [DEBUG SHUTDOWN] Shutdown: decoder=0x61600044d080 state machine=0x61d000633880
11033
[task 2017-12-30T23:07:28.298Z] 23:07:28     INFO - GECKO(1639) | [DEBUG SHUTDOWN] Enter: state machine=0x6020003f6950 reader=0x60d00001cb50
11034
[task 2017-12-30T23:07:28.299Z] 23:07:28     INFO - GECKO(1639) | [DEBUG SHUTDOWN] Shutdown: reader=0x61e000048c80 shutdown demuxer=0x6020005a3150
11035
[task 2017-12-30T23:07:28.300Z] 23:07:28     INFO - GECKO(1639) | [DEBUG SHUTDOWN] Shutdown: pool=0x60400030ab90 count=1
11036
[task 2017-12-30T23:07:28.301Z] 23:07:28     INFO - GECKO(1639) | [DEBUG SHUTDOWN] operator(): pool=0x60400030ab90 shutdown=true count=0
11037
[task 2017-12-30T23:07:28.301Z] 23:07:28     INFO - GECKO(1639) | [DEBUG SHUTDOWN] TearDownDecoders: reader=0x61e000048c80 shut down audio task queue
11038
[task 2017-12-30T23:07:28.302Z] 23:07:28     INFO - GECKO(1639) | [DEBUG SHUTDOWN] TearDownDecoders: reader=0x61e000048c80 shut down video task queue
11039
[task 2017-12-30T23:07:28.303Z] 23:07:28     INFO - GECKO(1639) | [DEBUG SHUTDOWN] FinishShutdown: state machine=0x61d000633880
11040
[task 2017-12-30T23:07:28.304Z] 23:07:28     INFO - GECKO(1639) | [DEBUG SHUTDOWN] Unregister: decoder=0x61600044d080, count=1
11041
[task 2017-12-30T23:07:28.999Z] 23:07:28     INFO - GECKO(1639) | [DEBUG SHUTDOWN] Register: decoder=0x616000501680, count=2
11042
[task 2017-12-30T23:07:30.111Z] 23:07:30     INFO - GECKO(1639) | MEMORY STAT | vsize 20973770MB | residentFast 889MB
11043
[task 2017-12-30T23:07:30.128Z] 23:07:30     INFO - TEST-OK | dom/media/test/test_mediarecorder_playback_can_repeat.html | took 2428ms
11044
[task 2017-12-30T23:07:30.188Z] 23:07:30     INFO - TEST-START | dom/media/test/test_mediarecorder_record_4ch_audiocontext.html
11045
[task 2017-12-30T23:07:30.254Z] 23:07:30     INFO - GECKO(1639) | [DEBUG SHUTDOWN] Shutdown: decoder=0x616000209480 state machine=0x61d000c61680
11046
[task 2017-12-30T23:07:30.255Z] 23:07:30     INFO - GECKO(1639) | [DEBUG SHUTDOWN] Enter: state machine=0x602000613670 reader=0x60d00025b590
11047
[task 2017-12-30T23:07:30.255Z] 23:07:30     INFO - GECKO(1639) | [DEBUG SHUTDOWN] Shutdown: reader=0x61e000085880 shutdown demuxer=0x602000629c90
11048
[task 2017-12-30T23:07:30.256Z] 23:07:30     INFO - GECKO(1639) | [DEBUG SHUTDOWN] Shutdown: pool=0x6040001fc390 count=1
11049
[task 2017-12-30T23:07:30.256Z] 23:07:30     INFO - GECKO(1639) | [DEBUG SHUTDOWN] operator(): pool=0x6040001fc390 shutdown=true count=0
11050
[task 2017-12-30T23:07:30.256Z] 23:07:30     INFO - GECKO(1639) | [DEBUG SHUTDOWN] TearDownDecoders: reader=0x61e000085880 shut down audio task queue
11051
[task 2017-12-30T23:07:30.257Z] 23:07:30     INFO - GECKO(1639) | [DEBUG SHUTDOWN] TearDownDecoders: reader=0x61e000085880 shut down video task queue
11052
[task 2017-12-30T23:07:30.259Z] 23:07:30     INFO - GECKO(1639) | [DEBUG SHUTDOWN] FinishShutdown: state machine=0x61d000c61680
11053
[task 2017-12-30T23:07:30.260Z] 23:07:30     INFO - GECKO(1639) | [DEBUG SHUTDOWN] Unregister: decoder=0x616000209480, count=1
11054
[task 2017-12-30T23:07:31.478Z] 23:07:31     INFO - GECKO(1639) | MEMORY STAT | vsize 20973795MB | residentFast 901MB
11055
[task 2017-12-30T23:07:31.494Z] 23:07:31     INFO - TEST-OK | dom/media/test/test_mediarecorder_record_4ch_audiocontext.html | took 1302ms
11056
[task 2017-12-30T23:07:31.559Z] 23:07:31     INFO - TEST-START | dom/media/test/test_mediarecorder_record_addtracked_stream.html
Whiteboard: [stockwell unknown] → [stockwell needswork]
jib: do you have any updates on this?
Flags: needinfo?(jib)
Flags: needinfo?(jib)
Comment on attachment 8940916 [details]
Bug 1403307 - Disable dom/media/test/test_mediarecorder_pause_resume_video.html on linux64 as well for frequent failures.

https://reviewboard.mozilla.org/r/211180/#review217084
Attachment #8940916 - Flags: review?(bvandyk) → review+
Pushed by jbruaroey@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ee54a2ff9373
Disable dom/media/test/test_mediarecorder_pause_resume_video.html on linux64 as well for frequent failures. r=bryce
https://hg.mozilla.org/mozilla-central/rev/ee54a2ff9373
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Whiteboard: [stockwell needswork] → [stockwell disabled]
I've debugged a case of this intermittent, and what happens is that the last (green) frame hits the underrun path in [1].

The only way we can append the same frame again is then to wait 1 second, but we reach EOS before that happens.

I need to look at why exactly we hit the underrun path here, and check whether some logic is flawed.



[1] https://searchfox.org/mozilla-central/rev/ce9ff94ffed34dc17ec0bfa406156d489eaa8ee1/dom/media/encoder/TrackEncoder.cpp#769
Whiteboard: [stockwell disabled] → [stockwell disabled][Pernosco]
You need to log in before you can comment on or make changes to this bug.