Open Bug 1405083 Opened 2 years ago Updated 4 days ago

Intermittent dom/media/tests/mochitest/test_getUserMedia_basicScreenshare.html | Test timed out.

Categories

(Core :: WebRTC: Audio/Video, defect, P3)

defect

Tracking

()

People

(Reporter: intermittent-bug-filer, Unassigned, NeedInfo)

References

(Regression)

Details

(Keywords: intermittent-failure, leave-open, Whiteboard: [stockwell disabled])

Attachments

(6 files)

Seems to have started when bug 1380346 landed.
Depends on: 1380346
The screenshot from the failure in comment 0 is pretty telling -- the canvas in fullscreen is all white.
Rank: 25
Component: Audio/Video → WebRTC: Audio/Video
Priority: P5 → P3
this is odd that we only fail on windows 10 (all configs of it) and 50 failures in 3 days.

:pehrsons, do you have anymore insight into why this might be happening?  Would you be able to help get this bug assigned to the right person?
Flags: needinfo?(apehrson)
Whiteboard: [stockwell needswork]
Yeah, I'm trying to reach out to graphics about this.

I didn't get any replies so far, but I'll ni? bas here to see where we can take it.

Bas, could you point me in the right direction for who might know about a fullscreen 2d canvas on windows being white? (see comment 3)
Flags: needinfo?(apehrson) → needinfo?(bas)
:bas, at almost 100 failures last week this is really frequent, can you help :pehrsons out?
I'm seeing this on macOS now, but only when I build with jya's icecream instructions at: https://github.com/jyavenard/mozilla-icecream

If I do a default build (no .mozconfig at all), I don't see this failure.
I see the same behavior as :mjf. Built opt with icecc from my macOS (10.12.6) laptop, I get a single pattern of 4 colored blocks that does not change.  When I build on the same machine sans a mozconfig, the test progresses through patterns. I have not tried building opt without icecc.
Opt build without icecc works as expected.
And, not surprisingly, the debug build using icecc hangs with the unchanging pattern just like the opt build using icecc.  Just fyi.
Odd. On a build that's broken, what do you see in a regular screenshare demo like https://mozilla.github.io/webrtc-landing/gum_test.html (click "Screen")?
Flags: needinfo?(na-g)
Flags: needinfo?(mfroman)
When I run the screenshare demo on the broken build I see a static image of what what on the screen when the demo started, and the cursor movement is shown.  FPS figures are updating.  However, moving windows around on the screen does _not_ show up.

When I run the screenshare demo in the working build I see an image of the screen with the "tunnel effect" (each of the screenshare windows is showing a smaller version of the screen).  FPS figures are updating, and window movement on the screen is shown as expected.

I'll grab a couple screen shots for clarity.
Flags: needinfo?(na-g)
Flags: needinfo?(mfroman)
I see the same thing as Michael. In the broken build I get a static screen including the mouse cursor at the time of capture. There is also a dynamically updating cursor.
Comment on attachment 8946308 [details]
NonWorkingBuildScreenShareDemo.jpeg

Oh, that's not good. And also weird.

Judging from the mouse cursor sizes is there one static and one moving, both part of the capture?
Flags: needinfo?(mfroman)
This smells like some sort of compiler/linker or mixed-SDK issue...  Do all the SDKs used match exactly?  (They'd better) and all the compiler options match exactly, and include paths?  And are any of the include paths system paths (which could be different?  Or can we verify they're all the same includes somehow, including those they pull in?
(In reply to Andreas Pehrson [:pehrsons] from comment #35)
> Comment on attachment 8946308 [details]
> NonWorkingBuildScreenShareDemo.jpeg
> 
> Oh, that's not good. And also weird.
> 
> Judging from the mouse cursor sizes is there one static and one moving, both
> part of the capture?

Yes exactly correct, one static and one moving mouse cursor.  The moving cursor seems to be the only thing updating in the capture.
Flags: needinfo?(mfroman)
So, I see a couple of cases recently where we are getting gray (236,237,231) instead of green, and there's a (gray) popup from the OS about an error unlocking a keychain:

https://treeherder.mozilla.org/logviewer.html#?job_id=173827645&repo=autoland&lineNumber=9020
https://taskcluster-artifacts.net/Jknb9ASvQumwsedBe_gw7A/0/public/test_info/mozilla-test-fail-screenshot_euMT_O.png

https://treeherder.mozilla.org/logviewer.html#?job_id=174470240&repo=mozilla-inbound&lineNumber=14115
https://taskcluster-artifacts.net/dOHE9xf-QhCrASDLQq4HkQ/0/public/test_info/mozilla-test-fail-screenshot_qk39jp.png


The other failures I see seem to be getting a blue (80,126,183) instead of a red, with no keychain popup. This shade of blue is close to (but not quite) the color of the desktop on these machines (screenshot from a different test):

https://taskcluster-artifacts.net/Dg0SDRWCTYa8roInRj1PYA/0/public/test_info/mozilla-test-fail-screenshot_iN231Z.png
Ping regarding keychain popups possibly throwing the screenshare tests off.
Flags: needinfo?(jlund)
(In reply to Byron Campen [:bwc] from comment #50)
> Ping regarding keychain popups possibly throwing the screenshare tests off.

ah, just seeing this needinfo now.

dhouse, last week you may remember one of our loan issues where the keychain kept asking for pw. Buildduty were able to loan the machine out without that prompt by re-imaging and reseting the keychain. However, it turns out this might be the very problem in automation. You can see the first two treeherder links in comment 49 failing to pass the screencapture test and the associated screenshot with the keychain prompt. Are you the right person to look at this?
Flags: needinfo?(jlund) → needinfo?(dhouse)
Jake, do you remember dealing with the keychain prompt for tests on osx before? I can dig into testing it and checking the configuration, but you might have some context here?
Flags: needinfo?(jwatkins)
There is a comment in puppet about the keychain for the build user:
```
            exec {
                # whenever the user password changes, we need to delete the keychain, otherwise
                # it will prompt on login
                'kill-builder-keychain':
                    command     => "/bin/rm -rf ${home}/Library/Keychains/login.keychain",
                    notify      => Exec['reboot_semaphore'],
                    refreshonly => true;
            

}
```(https://hg.mozilla.org/build/puppet/file/tip/modules/users/manifests/builder/account.pp#l114)

So the keychain for the user running the tasks *should be set up with the user's current password. If the password was changed without using puppet, or the keychain is split from the password in another way that this exec doesn't get called, then that would explain the keychain warning/popup.
I looked over the treeherder results and two mac minis are recorded as having this problem repeat over the past months (#408 appeared once, and another #321 is in for Jan 31, but it has a different failure):

#415, #0032
```
t-yosemite-r7-415     Wed May 2, 05:07:45
t-yosemite-r7-415     Sun Apr 29, 21:50:13
t-yosemite-r7-415     Sun Apr 29, 04:26:21
t-yosemite-r7-0032    Wed Apr 25, 16:18:31
t-yosemite-r7-0032    Mon Apr 23, 20:04:13
t-yosemite-r7-415     Mon Apr 23, 16:42:24
t-yosemite-r7-415     Fri Apr 20, 07:49:53
t-yosemite-r7-0032    Thu Apr 19, 00:54:52
t-yosemite-r7-415     Tue Apr 17, 20:57:41
t-yosemite-r7-408     Mon Apr 16, 16:42:01
t-yosemite-r7-0032    Sun Apr 15, 20:49:20
t-yosemite-r7-415     Fri Apr 13, 11:52:33
t-yosemite-r7-415     Thu Apr 12, 08:36:21
t-yosemite-r7-415     Fri Apr 6, 12:12:18
win10 Sat Mar 31, 05:37:12
win10 Sat Mar 30..
t-yosemite-r7-415   Thu Mar 29, 10:04:25
win10 Mar 29
t-yosemite-r7-415   Mon Mar 26, 05:07:31
win10 Mar 25
win10 Mar 25
win10 Mar 24
415     Sat Mar 24, 07:20:36
415     Fri Mar 23, 16:31:34
win10 Mar 23
win10 Mar 23
415     Thu Mar 22, 17:06:12
win10 Mar 17
415     Tue Mar 13, 08:28:55
win10 Feb 21,24(4x),25,27
415     Tue Feb 20, 04:02:45
win10 Feb 18
415     Sat Feb 17, 13:37:21
win10 Feb 15,16,17
415     Thu Feb 15, 12:26:25
win10 14,15(2x)
415     Tue Feb 13, 06:33:10
415     Mon Feb 12, 17:37:15
win10 Feb 11
415     Sat Feb 10, 10:53:02
win10 Feb 6,7,8
415     Mon Feb 5, 22:20:48
linux   Jan 31
321     Wed Jan 31, 11:16:46 ( tab crashed. no prompt in screenshot )
linux   Jan 31
win10   Jan 31
415     Tue Jan 23, 19:54:33
win10 Jan 3,7,9(2x),16,19
```
Flags: needinfo?(dhouse)
Depends on: 1459250
I've created bug 1459250 for rebuilding those two macs.
Flags: needinfo?(jwatkins)
(In reply to OrangeFactor Robot from comment #60)
> 35 failures in 632 pushes (0.055 failures/push) were associated with this
[...]
> https://treeherder.mozilla.org/intermittent-failures.html#/
> bugdetails?bug=1405083&startday=2018-05-21&endday=2018-05-27&tree=trunk

"Sun Apr 29, 2018 to Tue May 29, 2018 UTC. Intermittent dom/media/tests/mochitest/test_getUserMedia_basicScreenshare.html | Test timed out.
8 total failures"

Again the failures are all on t-yosemite-r7-415 

(In reply to Dave House [:dhouse] from comment #57)
> I've created bug 1459250 for rebuilding those two macs.

415 was rebuilt on May 4th.
The next comment from Orange factor should be affected because of this https://bugzilla.mozilla.org/show_bug.cgi?id=1270217#c22.
According to bug 1270217 comment 22 updating the default value for MACOS_DEPLOYMENT_TARGET from 10.7 to 10.9 make this permafail. I'm going to disable this test for now because without the patch in bug 1270217 local builds are broken for people building on Mac.
Pushed by jwatt@jwatt.org:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0ee6b755ab2e
Disable test_getUserMedia_basicScreenshare.html for now. r=permafail
(In reply to Michael Froman [:mjf] from comment #25)
> I'm seeing this on macOS now, but only when I build with jya's icecream
> instructions at: https://github.com/jyavenard/mozilla-icecream

Possibly due to the use of -mmacosx-version-min=10.11
Keywords: leave-open
Summary: Intermittent dom/media/tests/mochitest/test_getUserMedia_basicScreenshare.html | Test timed out. → [test disabled] Intermittent dom/media/tests/mochitest/test_getUserMedia_basicScreenshare.html | Test timed out.
I've spent several hours today looking for a lead that a non-media person might be able to find. No luck. Dan, would you be able to take a look at this or suggest someone better? The test now fails consistently when run locally. The screen capturing seems to be stuck at the initial captured frame, and no longer seems to updated (other than for the mouse cursor). I don't see anything useful in the system log, or in the console output from the test run (I may be missing something).
Flags: needinfo?(dminor)
At the time we did the last update from upstream webrtc.org, we still supported OS X 10.7 but they had already dropped support. This forced us to restore some older code, e.g. at [1]. If switching the minimum version of OS X is causing problems, there is likely a problem with the newer code. I filed Bug 1409018 to remove the ifdeffed code. I'll take a look at getting that fixed up.

The originally reported problem seemed to only happen on Windows 10, so it sounds like there are at least two separate things going on here.

[1] https://searchfox.org/mozilla-central/source/media/webrtc/trunk/webrtc/modules/desktop_capture/screen_capturer_mac.mm#323
Flags: needinfo?(dminor)
See Also: → 1409018
Test got reenabled in bug 1409018.
Summary: [test disabled] Intermittent dom/media/tests/mochitest/test_getUserMedia_basicScreenshare.html | Test timed out. → Intermittent dom/media/tests/mochitest/test_getUserMedia_basicScreenshare.html | Test timed out.
This has started becoming frequent on windows recently. Screenshots show that it's because of a windows update modal. Not sure who should look at this. jlund, can you help out again?
Flags: needinfo?(jlund)
(In reply to Andreas Pehrson [:pehrsons] from comment #89)
> Created attachment 9027834 [details]
> windows-needs-updates_2018-11-27
> 
> This has started becoming frequent on windows recently. Screenshots show
> that it's because of a windows update modal. Not sure who should look at
> this. jlund, can you help out again?

grenade, markco, I see from comment 88 that we are hitting these with windows cloud based instances. e.g. gecko-t-win10-64

Any ideas here?
Flags: needinfo?(rthijssen)
Flags: needinfo?(mcornmesser)
Flags: needinfo?(jlund)
we've put in patches for the windows update modal as part of bug 1510220 and a related problem in bug 1481628.

are we still seeing this issue since those patches?

for clarity,
- the bug 1510220 fix was implemented on gecko-t-win10-64 and gecko-t-win10-64-gpu yesterday in this patch: https://github.com/mozilla-releng/OpenCloudConfig/commit/c421eb8
- the same patch was applied to gecko-t-win10-64-hw and gecko-t-win10-64-ux (hardware) today in this patch: https://github.com/mozilla-releng/OpenCloudConfig/commit/3c874a7
- the bug 1481628 fix was applied to all win 10 workers (ec2 & hw) today in this patch: https://github.com/mozilla-releng/OpenCloudConfig/commit/3c874a7
Flags: needinfo?(rthijssen)
Flags: needinfo?(mcornmesser)

Since Bug 1558482 landed here: https://tinyurl.com/y2glf2yg this bug started having a very high failure rate: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-06-06&endday=2019-06-13&tree=all&bug=1405083

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=251680588&repo=mozilla-inbound&lineNumber=13760

[task 2019-06-13T16:16:18.486Z] 16:16:18 INFO - TEST-UNEXPECTED-FAIL | dom/media/tests/mochitest/test_getUserMedia_basicScreenshare.html | Test timed out.
[task 2019-06-13T16:16:18.487Z] 16:16:18 INFO - SimpleTest.ok@https://example.com/tests/SimpleTest/SimpleTest.js:275:18
[task 2019-06-13T16:16:18.487Z] 16:16:18 INFO - reportError@https://example.com/tests/SimpleTest/TestRunner.js:121:22
[task 2019-06-13T16:16:18.488Z] 16:16:18 INFO - TestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:142:7
[task 2019-06-13T16:16:18.489Z] 16:16:18 INFO - setTimeout handlerTestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.491Z] 16:16:18 INFO - setTimeout handler
TestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.492Z] 16:16:18 INFO - setTimeout handlerTestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.492Z] 16:16:18 INFO - setTimeout handler
TestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.493Z] 16:16:18 INFO - setTimeout handlerTestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.494Z] 16:16:18 INFO - setTimeout handler
TestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.494Z] 16:16:18 INFO - setTimeout handlerTestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.495Z] 16:16:18 INFO - setTimeout handler
TestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.495Z] 16:16:18 INFO - setTimeout handlerTestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.497Z] 16:16:18 INFO - setTimeout handler
TestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.497Z] 16:16:18 INFO - setTimeout handlerTestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.498Z] 16:16:18 INFO - setTimeout handler
TestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.498Z] 16:16:18 INFO - setTimeout handlerTestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.498Z] 16:16:18 INFO - setTimeout handler
TestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.498Z] 16:16:18 INFO - setTimeout handlerTestRunner._checkForHangs@https://example.com/tests/SimpleTest/TestRunner.js:163:5
[task 2019-06-13T16:16:18.499Z] 16:16:18 INFO - TestRunner.runTests/<@https://example.com/tests/SimpleTest/TestRunner.js:381:20
[task 2019-06-13T16:16:18.500Z] 16:16:18 INFO - promise callback
TestRunner.runTests@https://example.com/tests/SimpleTest/TestRunner.js:368:50
[task 2019-06-13T16:16:18.501Z] 16:16:18 INFO - RunSet.runtests@https://example.com/tests/SimpleTest/setup.js:201:14
[task 2019-06-13T16:16:18.502Z] 16:16:18 INFO - RunSet.runall@https://example.com/tests/SimpleTest/setup.js:180:12
[task 2019-06-13T16:16:18.502Z] 16:16:18 INFO - hookupTests@https://example.com/tests/SimpleTest/setup.js:273:12
[task 2019-06-13T16:16:18.503Z] 16:16:18 INFO - parseTestManifest@https://example.com/manifestLibrary.js:36:5
[task 2019-06-13T16:16:18.503Z] 16:16:18 INFO - getTestManifest/req.onload@https://example.com/manifestLibrary.js:49:11
[task 2019-06-13T16:16:18.504Z] 16:16:18 INFO - EventHandlerNonNullgetTestManifest@https://example.com/manifestLibrary.js:45:3
[task 2019-06-13T16:16:18.505Z] 16:16:18 INFO - hookup@https://example.com/tests/SimpleTest/setup.js:253:5
[task 2019-06-13T16:16:18.506Z] 16:16:18 INFO - EventHandlerNonNull
@https://example.com/tests?autorun=1&closeWhenDone=1&consoleLevel=INFO&manifestFile=tests.json&dumpOutputDirectory=%2Ftmp&cleanupCrashes=true:11:1
[task 2019-06-13T16:16:18.507Z] 16:16:18 INFO - Checking pixel against grey. Got [148,148,148,255] (NO)
[task 2019-06-13T16:16:18.508Z] 16:16:18 INFO - Checking pixel against grey. Got [148,148,148,255] (NO)
[task 2019-06-13T16:16:18.727Z] 16:16:18 INFO - Checking pixel against grey. Got [148,148,148,255] (NO)
[task 2019-06-13T16:16:18.977Z] 16:16:18 INFO - Checking pixel against grey. Got [148,148,148,255] (NO)
[task 2019-06-13T16:16:19.025Z] 16:16:19 INFO - GECKO(2565) | MEMORY STAT | vsize 20974980MB | residentFast 1056MB
[task 2019-06-13T16:16:19.087Z] 16:16:19 INFO - TEST-OK | dom/media/tests/mochitest/test_getUserMedia_basicScreenshare.html | took 310800ms

Ryan, could you please take a look at this?

Flags: needinfo?(rhunt)
Regressed by: 1558482

Looks like bug 1558482 broke fullscreen somehow; here's a screenshot from one of the recent failures. This is a canvas element in fullscreen. The white part and the text on the right are both unexpected (and the reason the test is failing).

Pushed by apavel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/98b09c54784f
disabled test_getUserMedia_basicScreenshare.html on linux64 r=aryx
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]
Pushed by apavel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4a9c8e21ef23
add missing bracket on a CLOSED TREE

(In reply to Andreas Pehrson [:pehrsons] from comment #121)

Created attachment 9072109 [details]
Bug 1558482 fullscreen canvas regression, 2019-06-14

Looks like bug 1558482 broke fullscreen somehow; here's a screenshot from
one of the recent failures. This is a canvas element in fullscreen. The
white part and the text on the right are both unexpected (and the reason the
test is failing).

That sounds like a duplicate of bug 1559518 comment 4. I have a patch on that bug which should fix this hopefully.

Flags: needinfo?(rhunt)

(In reply to Ryan Hunt [:rhunt] from comment #129)

That sounds like a duplicate of bug 1559518 comment 4. I have a patch on that bug which should fix this hopefully.

That one's fixed now. I suppose we can re-enable this test then.

Pushed by pehrsons@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/6db28048fd8a
Re-enable test_getUserMedia_basicScreenshare.html on linux64. r=rhunt

We're still seeing failures on OS 10.14, the screenshot seems to be grey:

https://taskcluster-artifacts.net/VpU4EwqnSc-hsIZVlJZ-_g/0/public/test_info/mozilla-test-fail-screenshot_HEedZn.png

Not sure if this regressed along with bug 1558482. It seems that the failures on OS 10.14 started on June 14, a day after the failures started on linux, but that may just be because the rate on OS 10.14 is much lower than it was on linux.

(In reply to Byron Campen [:bwc] from comment #139)

We're still seeing failures on OS 10.14, the screenshot seems to be grey:

https://taskcluster-artifacts.net/VpU4EwqnSc-hsIZVlJZ-_g/0/public/test_info/mozilla-test-fail-screenshot_HEedZn.png

Not sure if this regressed along with bug 1558482. It seems that the failures on OS 10.14 started on June 14, a day after the failures started on linux, but that may just be because the rate on OS 10.14 is much lower than it was on linux.

Another issue is this OS modal from a 10.14 machine.

IMO both of these cases indicate some other problem, since bug 1558482 was fixed but the grey screens on 10.14 continue.

Edwin, can you take a look? (Not sure if you're the right person for this, but I suspect you know who to bring it to)

Remembering to set the needinfo is good.

Flags: needinfo?(egao)

:pehrsons - comment 141 could be the way the macosx1014 machine are configured - I'm aware of at least one other failure caused by some OS-level UI notification on macosx1014.

Comment 137 I am not as sure about - got no leads.

Pinging :dividehex and :dhouse - does the modal from comment 141 look like an OS configuration issue? How about comment 137, is that a test issue?

Flags: needinfo?(jwatkins)
Flags: needinfo?(egao)
Flags: needinfo?(dhouse)

(In reply to Edwin Gao (:egao) from comment #143)

:pehrsons - comment 141 could be the way the macosx1014 machine are configured - I'm aware of at least one other failure caused by some OS-level UI notification on macosx1014.

Comment 137 I am not as sure about - got no leads.

Pinging :dividehex and :dhouse - does the modal from comment 141 look like an OS configuration issue? How about comment 137, is that a test issue?

Yes, that looks like wifi turned on and connected to the datacenter's wifi automatically. I'll talk with :dividehex about it and look for how to prevent this.

Depends on: 1570375

(In reply to Dave House [:dhouse] from comment #144)

(In reply to Edwin Gao (:egao) from comment #143)

:pehrsons - comment 141 could be the way the macosx1014 machine are configured - I'm aware of at least one other failure caused by some OS-level UI notification on macosx1014.

Comment 137 I am not as sure about - got no leads.

Pinging :dividehex and :dhouse - does the modal from comment 141 look like an OS configuration issue? How about comment 137, is that a test issue?

Yes, that looks like wifi turned on and connected to the datacenter's wifi automatically. I'll talk with :dividehex about it and look for how to prevent this.

I've turned off wifi on the mojave workers.

Flags: needinfo?(jwatkins)
Flags: needinfo?(dhouse)
You need to log in before you can comment on or make changes to this bug.