1703346 - Intermittent dom/media/webrtc/tests/mochitests/test_getUserMedia_basicScreenshare.html | Error executing test: Error: Checking 72.75,43.75 against <colour> timed out. Got [210,213,203,255].

Reporter

Description

•

4 years ago

treeherder

Filed by: csabou [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=335585527&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/KI0zntPuQ86Mxv-MgiFujg/runs/0/artifacts/public/logs/live_backing.log

[task 2021-04-06T17:11:30.492Z] 17:11:30     INFO - TEST-FAIL | dom/media/webrtc/tests/mochitests/test_getUserMedia_basicScreenshare.html | The author of the test has indicated that flaky timeouts are expected.  Reason: WebRTC inherently depends on timeouts 
[task 2021-04-06T17:11:30.493Z] 17:11:30     INFO - Buffered messages logged at 17:11:29
[task 2021-04-06T17:11:30.493Z] 17:11:30     INFO - TEST-FAIL | dom/media/webrtc/tests/mochitests/test_getUserMedia_basicScreenshare.html | The author of the test has indicated that flaky timeouts are expected.  Reason: WebRTC inherently depends on timeouts 
[task 2021-04-06T17:11:30.494Z] 17:11:30     INFO - TEST-FAIL | dom/media/webrtc/tests/mochitests/test_getUserMedia_basicScreenshare.html | The author of the test has indicated that flaky timeouts are expected.  Reason: WebRTC inherently depends on timeouts 
[task 2021-04-06T17:11:30.496Z] 17:11:30     INFO - Buffered messages finished
[task 2021-04-06T17:11:30.496Z] 17:11:30     INFO - TEST-UNEXPECTED-FAIL | dom/media/webrtc/tests/mochitests/test_getUserMedia_basicScreenshare.html | Error executing test: Error: Checking 72.75,43.75 against grey timed out. Got [210,213,203,255].
[task 2021-04-06T17:11:30.497Z] 17:11:30     INFO - verifyAround/<.cancel<@https://example.com/tests/dom/media/webrtc/tests/mochitests/test_getUserMedia_basicScreenshare.html:49:13
[task 2021-04-06T17:11:30.498Z] 17:11:30     INFO -  
[task 2021-04-06T17:11:30.498Z] 17:11:30     INFO - SimpleTest.ok@https://example.com/tests/SimpleTest/SimpleTest.js:417:16
[task 2021-04-06T17:11:30.502Z] 17:11:30     INFO - runTestWhenReady@https://example.com/tests/dom/media/webrtc/tests/mochitests/head.js:476:7
[task 2021-04-06T17:11:30.502Z] 17:11:30     INFO - GECKO(5515) | MEMORY STAT | vsize 2752MB | residentFast 162MB | heapAllocated 16MB
[task 2021-04-06T17:11:30.503Z] 17:11:30     INFO - TEST-OK | dom/media/webrtc/tests/mochitests/test_getUserMedia_basicScreenshare.html | took 40424ms```

Comment hidden (Intermittent Failures Robot)

Natalia Csoregi [:nataliaCs]

Updated

•

4 years ago

Summary: Intermittent dom/media/webrtc/tests/mochitests/test_getUserMedia_basicScreenshare.html | Error executing test: Error: Checking 72.75,43.75 against grey timed out. Got [210,213,203,255]. → Intermittent dom/media/webrtc/tests/mochitests/test_getUserMedia_basicScreenshare.html | Error executing test: Error: Checking 72.75,43.75 against <colour> timed out. Got [210,213,203,255].

Comment hidden (Intermittent Failures Robot)

Byron Campen [:bwc]

Comment 11

•

4 years ago

So it looks like there's an error popup obscuring the red section that we're trying to capture in the failure cases:

https://firefoxci.taskcluster-artifacts.net/SrOo9k4ZQ-e75VpZt1oaFA/0/public/test_info/mozilla-test-fail-screenshot_lpjlk5rx.png

I can't make out the text on the error popup though. I guess we could change the location that we sample to avoid this, but it would be better to determine why that popup is there and fix that bug.

Byron Campen [:bwc]

Comment 12

•

4 years ago

Upping priority since it is so common now. Root cause (the error popup) might be in another component, but hard to tell for sure right now.

Severity: S4 → S2

Component: Audio/Video → WebRTC: Audio/Video

Priority: P5 → P2

Byron Campen [:bwc]

Comment 13

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=35e7e45518013712e45b8d34c0f59d5c2570406e

Byron Campen [:bwc]

Updated

•

4 years ago

Comment 14

•

4 years ago

Information on the see-also I just added: https://bugzilla.mozilla.org/show_bug.cgi?id=1714410#c51

Byron Campen [:bwc]

Comment 15

•

4 years ago

•

Edited

Checking to see if the stack I got in https://bugzilla.mozilla.org/show_bug.cgi?id=1714410#c47 is similar to the stack where we get the ysod (Yellow Screen of Death, the proper name for the error popup): https://treeherder.mozilla.org/#/jobs?repo=try&revision=8509085bcb28f2796256db2d378ff21b40acc8a7

Byron Campen [:bwc]

Updated

•

4 years ago

Comment 16

•

4 years ago

Seems like the stack involves the handlers for OnStopRequest, just like the stack from bug 1714410:

https://treeherder.mozilla.org/logviewer?job_id=343892820&repo=try&lineNumber=8790

Comment hidden (Intermittent Failures Robot)

Emma Zühlcke [:emz]

Updated

•

4 years ago

Comment 19

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=08cd606d5fd0cf8ae636710c9009bd856ab50865

Byron Campen [:bwc]

Comment 20

•

4 years ago

So I can confirm that OnStopRequest is not passing a failing error code in |status| in this case.

Byron Campen [:bwc]

Comment 21

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=adb3fbdc3a96949efbcdd9d0078a5af8e953eded

Byron Campen [:bwc]

Comment 22

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ce95882815c7a1780b84fe3a2d02c988125f3848

Byron Campen [:bwc]

Comment 23

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5ac7a191297ac0e56abee72c5fcf17185fc9fff0

Byron Campen [:bwc]

Comment 24

•

4 years ago

So it seems that at the level of nsInputStreamPump, OnStopRequest does have a failure set in |status|. Maybe getting lost someplace?

Byron Campen [:bwc]

Comment 25

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0f79b4c8e154a810172b655746b3b39e25038232

Byron Campen [:bwc]

Comment 26

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=bbb8445c98c7fc4323e6c4a4e63c102b23295739

Comment hidden (Intermittent Failures Robot)

Byron Campen [:bwc]

Comment 28

•

4 years ago

•

Edited

(In reply to Byron Campen [:bwc] from comment #20)

So I can confirm that OnStopRequest is not passing a failing error code in |status| in this case.

Ok, I was wrong about this, because apparently NS_WARNING is not outputting anything to the console (why?). Adding a MOZ_CRASH for this case (see comment 26) shows that |status| is NS_FAILED in nsParser::OnStopRequest, but we are resuming the parse anyway. This is probably not the right thing to be doing, right?

Flags: needinfo?(nika)

Nika Layzell [:nika] (ni? for response)

Comment 29

•

4 years ago

(In reply to Byron Campen [:bwc] from comment #28)

(In reply to Byron Campen [:bwc] from comment #20)

So I can confirm that OnStopRequest is not passing a failing error code in |status| in this case.

Ok, I was wrong about this, because apparently NS_WARNING is not outputting anything to the console (why?).

NS_WARNING is a no-op in release builds (https://searchfox.org/mozilla-central/rev/5227b2bd674d49c0eba365a709d3fb341534f361/xpcom/base/nsDebug.h#130-137), so in opt runs it won't produce any output.

Adding a MOZ_CRASH for this case (see comment 26) shows that |status| is NS_FAILED in nsParser::OnStopRequest, but we are resuming the parse anyway. This is probably not the right thing to be doing, right?

I don't know if that's necessarily wrong. I'm not super familiar with the HTML parser, so ni? :hsivonen as the HTML-parser-expert :-)

Flags: needinfo?(nika) → needinfo?(hsivonen)

Henri Sivonen (:hsivonen)

Comment 30

•

4 years ago

(In reply to Nika Layzell [:nika] (ni? for response) from comment #29)

(In reply to Byron Campen [:bwc] from comment #28)

(In reply to Byron Campen [:bwc] from comment #20)

So I can confirm that OnStopRequest is not passing a failing error code in |status| in this case.

Ok, I was wrong about this, because apparently NS_WARNING is not outputting anything to the console (why?).

NS_WARNING is a no-op in release builds (https://searchfox.org/mozilla-central/rev/5227b2bd674d49c0eba365a709d3fb341534f361/xpcom/base/nsDebug.h#130-137), so in opt runs it won't produce any output.

Adding a MOZ_CRASH for this case (see comment 26) shows that |status| is NS_FAILED in nsParser::OnStopRequest, but we are resuming the parse anyway. This is probably not the right thing to be doing, right?

I don't know if that's necessarily wrong. I'm not super familiar with the HTML parser, so ni? :hsivonen as the HTML-parser-expert :-)

Note that this isn't the HTML parser. This is the XML / about:blank parser.

The NS_SUCCEEDED(rv) bit looks like a bug, because rv has only ever been initialized to NS_OK earlier. However, when the NS_SUCCEEDED(rv) bit was added, it made sense.

It's not obviously bogus to consume the data that we already got even if we get notified of the stream ending in error.

peterv, do you have more XML parsing insight for this?

Flags: needinfo?(hsivonen) → needinfo?(peterv)

Byron Campen [:bwc]

Comment 31

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0cfe6084e27e4dfae524d54f56622f420374e647

Byron Campen [:bwc]

Comment 32

•

4 years ago

Checking |status| before resuming the parse seems to fix the bug. Running other tests.

Byron Campen [:bwc]

Comment 33

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=79ec82a43c3816e25ed00b0cc1899ff4c9302c60

Byron Campen [:bwc]

Comment 34

•

4 years ago

So, it seems that this patch introduces a new "YOU ARE LEAKING THE WORLD" failure, this time on windows10-32-qr debug. The quick-and-dirty fix from bug 1714410 does not prevent this, so it seems there's another leak bug we're going to need to chase here.

Byron Campen [:bwc]

Comment 35

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9b81affefd11141125c084366c5b8a483a0197c7

Byron Campen [:bwc]

Comment 36

•

4 years ago

I think I figured out the issue in comment 34. Here are some webrtc-specific try pushes.

With fix:
https://treeherder.mozilla.org/jobs?repo=try&revision=f49ecd3f565b6c90f09304d04c417be0db6cbf65
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ff84b6cbd984e71ba81655daa0a185363b8a0d14

Baseline:
https://treeherder.mozilla.org/jobs?repo=try&revision=cd25492c4151e65e3cb5a5751ee0eecf025800e8
https://treeherder.mozilla.org/#/jobs?repo=try&revision=632a42cf3b10f382deb4e5d5c4598d1ece1a4fa1

Since this is a modification that effects anything XUL-related, I'll need to run much more testing than this, but I am having difficulty getting try to run a more thorough set of tests. I am not sure what the problem is here.

https://treeherder.mozilla.org/jobs?repo=try&revision=387990cdd7cd7fc6f0beac0bad443606d6417210

Byron Campen [:bwc]

Comment 37

•

4 years ago

Broader try pushes (mochitest and xpcshell). Not sure what else we need to cover this stuff...

With fix:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb7e449a9724ca8240937743a3879e9fd388fc2c
https://treeherder.mozilla.org/#/jobs?repo=try&revision=db97a6b0a2f7a8193db27537e2764041f3c7db55

Baseline:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=63acda7d6079cc0bfe16a296b7308791ee81b081

Byron Campen [:bwc]

Comment 38

•

4 years ago

Bug seems to be fixed in try pushes in comment 36, without apparent regressions or increases in rate of existing intermittents in webrtc mochitests. There is only one failure (pre-existing intermittent) present in the push for webrtc wpt, running retriggers to see if the rate has changed.

Byron Campen [:bwc]

Comment 39

•

4 years ago

•

Edited

Broader mochitest push (comment 37) seems to have a similar profile of oranges to baseline, but will require further analysis. xpcshell push (also comment 37) seems to look different, retriggering to see if we just got unlucky.

Byron Campen [:bwc]

Comment 40

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e83eea3f85cc9d6e085730526a5431fd865ab153

Byron Campen [:bwc]

Comment 41

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ba6b4babbc7f7f9c2ad0b27553babbc21b67c3e8

Byron Campen [:bwc]

Comment 42

•

4 years ago

•

Edited

Let's see what happens on try when we add a MOZ_CRASH on ysod.

Baseline mochitests: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e83eea3f85cc9d6e085730526a5431fd865ab153
Mochitests with fix: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ba6b4babbc7f7f9c2ad0b27553babbc21b67c3e8

Edit: Seeing lots of cases where we hit that MOZ_CRASH, and in many of those cases (particularly on linux), that crash seems to be replacing "YOU ARE LEAKING THE WORLD" failures. This hints that both are being caused by the same thing, similar to how this bug and bug 1714410 are caused by the same thing (inappropriate handling of OnStopRequest with an error code). It does not appear that we can leave this MOZ_CRASH (or a MOZ_ASSERT) in here, since that breaks parser/htmlparser/tests/mochitest/browser_ysod_telemetry.js (which deliberately triggers a ysod). Maybe there's something we could do to that test-case to tell nsExpatDriver that a ysod is expected?

Byron Campen [:bwc]

Comment 43

•

4 years ago

What do you think about the idea of putting a MOZ_ASSERT in the ysod case, and finding a way to disable that assertion when we're deliberately triggering a ysod (eg; parser/htmlparser/tests/mochitest/browser_ysod_telemetry.js)?

Flags: needinfo?(zbraniecki)

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 44

•

4 years ago

I'm ok with that.

Flags: needinfo?(zbraniecki)

Byron Campen [:bwc]

Updated

•

4 years ago

Blocks: 1720892

Byron Campen [:bwc]

Comment 45

•

4 years ago

•

Edited

Ok, after loads of retriggers on some very unlucky jobs, the oranges we have look similar to baseline. Can't rule out all cases of making an existing orange worse, or introducing a new intermittent somewhere, since that would require running all the tests in the tree loads of times.

Byron Campen [:bwc]

Comment 46

•

4 years ago

Attached file Bug 1703346: Log xml errors to the expatdriver log module. r?zbraniecki — Details

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → docfaraday

Status: NEW → ASSIGNED

Byron Campen [:bwc]

Comment 47

•

4 years ago

Attached file Bug 1703346: Don't resume parse if |status| is failed. r?zbraniecki — Details

Depends on D120107

Byron Campen [:bwc]

Updated

•

4 years ago

Whiteboard: ysod

Byron Campen [:bwc]

Updated

•

4 years ago

Blocks: 1675823

Byron Campen [:bwc]

Comment 48

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab1a453cbb4e449a1c895bac4c0133b4a335b4e0

Comment hidden (Intermittent Failures Robot)

Peter Van der Beken [:peterv]

Comment 50

•

4 years ago

(In reply to Byron Campen [:bwc] from comment #28)

(In reply to Byron Campen [:bwc] from comment #20)

So I can confirm that OnStopRequest is not passing a failing error code in |status| in this case.

Ok, I was wrong about this, because apparently NS_WARNING is not outputting anything to the console (why?). Adding a MOZ_CRASH for this case (see comment 26) shows that |status| is NS_FAILED in nsParser::OnStopRequest, but we are resuming the parse anyway. This is probably not the right thing to be doing, right?

What is causing the network failure? Isn't this just wallpaper over that error?

Flags: needinfo?(peterv) → needinfo?(docfaraday)

Byron Campen [:bwc]

Comment 51

•

4 years ago

(In reply to Peter Van der Beken [:peterv] from comment #50)

(In reply to Byron Campen [:bwc] from comment #28)

(In reply to Byron Campen [:bwc] from comment #20)

So I can confirm that OnStopRequest is not passing a failing error code in |status| in this case.

Ok, I was wrong about this, because apparently NS_WARNING is not outputting anything to the console (why?). Adding a MOZ_CRASH for this case (see comment 26) shows that |status| is NS_FAILED in nsParser::OnStopRequest, but we are resuming the parse anyway. This is probably not the right thing to be doing, right?

What is causing the network failure? Isn't this just wallpaper over that error?

In the case of dom/media/webrtc/tests/mochitests/test_1488832.html, this seems to be an aborted load; the test case repeatedly performs a getUserMedia and reloads the page, which prompts us to try to load and display the webrtc sharing indicator each time. Sometimes, on some platforms, the page reload happens before we finish loading/parsing the indicator.

I do not know the reason this happens in other test cases (eg; browser/extensions/formautofill/test/browser/browser_first_time_use_doorhanger.js, browser/components/uitour/test/browser_openPreferences.js, browser/base/content/test/general/browser_datachoices_notification.js), but I suspect this is a similar story.

I suppose I could figure out which nsresult corresponds to the aborted load, and only avoid resuming the parse for that error?

Flags: needinfo?(docfaraday) → needinfo?(peterv)

Peter Van der Beken [:peterv]

Comment 52

•

4 years ago

Even an aborted load can lead to a very broken UI. The problem is not showing the YSOD, displaying a broken UI is just as bad. At least with a YSOD there's an indication that something is very wrong and the UI is broken.

If the page is reloaded, why are we still showing the indicator for the previous load? Also, why is a UI load aborted for a reload of web content?

Flags: needinfo?(peterv) → needinfo?(docfaraday)

Byron Campen [:bwc]

Comment 53

•

4 years ago

(In reply to Peter Van der Beken [:peterv] from comment #52)

Even an aborted load can lead to a very broken UI. The problem is not showing the YSOD, displaying a broken UI is just as bad. At least with a YSOD there's an indication that something is very wrong and the UI is broken.

If the page is reloaded, why are we still showing the indicator for the previous load? Also, why is a UI load aborted for a reload of web content?

Nothing is telling the parser "Never mind, we decided we aren't going to load/display this after all" besides the OnStopRequest callback, I think. The webrtc indicator is intended to be displayed when there's an active capture (camera, mic, or screen/window sharing), so if the (content) document that is performing that capture goes away, the (chrome) indicator needs to go away also. So we are starting a capture, which begins the process of loading/displaying the indicator, then we reload the page, which stops the capture, which in turn causes the indicator to be unloaded, and sometimes this all occurs before we finish loading the indicator. (See this.)

If there is some existing mechanism to tell the parser "Never mind, we don't need this after all." before OnStopRequest, let's use that of course, but I'm not aware of what mechanism that might be.

Flags: needinfo?(docfaraday) → needinfo?(peterv)

Byron Campen [:bwc]

Comment 54

•

4 years ago

FWIW, it seems that in practice we only see NS_BINDING_ABORTED as a failure code here.

https://treeherder.mozilla.org/jobs?repo=try&revision=230b5c75927454f1e156e4e7611b64094681087b

Peter Van der Beken [:peterv]

Comment 55

•

4 years ago

(In reply to Byron Campen [:bwc] from comment #53)

Nothing is telling the parser "Never mind, we decided we aren't going to load/display this after all" besides the OnStopRequest callback, I think.

OnStopRequest just means that no additional data is going to be passed to the consumer. In the XML parser's case that means we finish parsing and produce an error if the document is malformed. It doesn't signal anything about whether we're going to display the document or not, and in fact according to comment 11 we are actually displaying this document. It seems to me that the real issue is that we should stop displaying the document for the indicator when the capture stopped, and having a YSOD instead of a broken document doesn't affect that. What is preventing the window from being closed?

Also, just to make sure I understand what's going on: we seem to be relying on code running in the indicator document (onunload, DOMWindowClose, ...) but we're closing the window for it before it has completely loaded. Isn't it problematic that that code might not run (for all we know we the script might not even have loaded).

Flags: needinfo?(peterv)

Byron Campen [:bwc]

Comment 56

•

4 years ago

•

Edited

(In reply to Peter Van der Beken [:peterv] from comment #55)

(In reply to Byron Campen [:bwc] from comment #53)

Nothing is telling the parser "Never mind, we decided we aren't going to load/display this after all" besides the OnStopRequest callback, I think.

OnStopRequest just means that no additional data is going to be passed to the consumer. In the XML parser's case that means we finish parsing and produce an error if the document is malformed. It doesn't signal anything about whether we're going to display the document or not, and in fact according to comment 11 we are actually displaying this document.It seems to me that the real issue is that we should stop displaying the document for the indicator when the capture stopped, and having a YSOD instead of a broken document doesn't affect that. What is preventing the window from being closed?

That window is the ysod, not the indicator. The indicator is never loaded or displayed, I think. Something is clearly telling the necko code "never mind" (because it is firing an OnStopRequest with NS_BINDING_ABORTED). The parser obviously needs to know this too, the question is whether it learns it second-hand from necko through OnStopRequest, or some other means. I've implemented the former, but am willing to implement the latter if someone gives me a pointer to where that ought to happen.

Also, just to make sure I understand what's going on: we seem to be relying on code running in the indicator document (onunload, DOMWindowClose, ...) but we're closing the window for it before it has completely loaded. Isn't it problematic that that code might not run (for all we know we the script might not even have loaded).

Since this is simply an indicator that is intended to be displayed only when there is an active capture, it is not problematic that it is never displayed. Unloading a not-yet-loaded-and-parsed chrome element is a fairly common occurrence in our CI (see comment 51 and comment 42), and does not strike me as a should-never-happen kind of event (if it is a should-never-happen event, then let's fix the code that allows an onunload to fire before the doc is loaded). If there is some other callback than OnStopRequest that ought to be used to tell the parser "Whoops, never mind, we don't need to display this after all, so don't bother parsing it.", I would be happy to use that instead.

Byron Campen [:bwc]

Comment 57

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2a2c68a7648cb9e13385d3371092dcdbca19e9d6

Jim Mathies [:jimm]

Updated

•

4 years ago

Flags: needinfo?(peterv)

Peter Van der Beken [:peterv]

Comment 58

•

4 years ago

If you are seeing a YSOD in this case, then we're already trying to display the indicator. We've opened its window and we're loading the indicator document in it. If at that point we don't expect any cleanup code to run from within the indicator document itself, then there's no issue. I don't know anything about the indicator, but from a quick skim of the code I wasn't sure that that's the case.

I think you're focusing on the wrong fix for this issue. Displaying a YSOD is a symptom of a combination of two things: 1) displaying a window and 2) loading a broken document into that window. In the case of XML documents we've always opted to signal a broken document by replacing it with the YSOD content. The broken document can have different causes, for example aborting the network load.

In this case it looks like we get a NS_BINDING_ABORTED. I don't think you've explained why we're getting that, it seems to me like we would at least want to know that. But let's say that that NS_BINDING_ABORTED is benign, somehow a result of not wanting to display the window anymore. The parser actually doesn't need to know anything about that. Whatever is deciding that we don't want to display that window anymore is responsible for hiding or closing it. Making the parser not emit a YSOD doesn't change that, the window will either contain the YSOD or the broken document, and making it display the broken document doesn't make the window disappear.

Flags: needinfo?(peterv)

Comment hidden (Intermittent Failures Robot)

Byron Campen [:bwc]

Comment 61

•

4 years ago

(In reply to Peter Van der Beken [:peterv] from comment #58)

If you are seeing a YSOD in this case, then we're already trying to display the indicator. We've opened its window and we're loading the indicator document in it. If at that point we don't expect any cleanup code to run from within the indicator document itself, then there's no issue. I don't know anything about the indicator, but from a quick skim of the code I wasn't sure that that's the case.

I think you're focusing on the wrong fix for this issue. Displaying a YSOD is a symptom of a combination of two things: 1) displaying a window and 2) loading a broken document into that window. In the case of XML documents we've always opted to signal a broken document by replacing it with the YSOD content. The broken document can have different causes, for example aborting the network load.

In this case it looks like we get a NS_BINDING_ABORTED. I don't think you've explained why we're getting that, it seems to me like we would at least want to know that. But let's say that that NS_BINDING_ABORTED is benign, somehow a result of not wanting to display the window anymore. The parser actually doesn't need to know anything about that. Whatever is deciding that we don't want to display that window anymore is responsible for hiding or closing it. Making the parser not emit a YSOD doesn't change that, the window will either contain the YSOD or the broken document, and making it display the broken document doesn't make the window disappear.

So you're saying the ysod is the unsuccessfully loaded indicator, and the fact that the ysod is still around long after the test that tried to display the indicator means that the code that hides/destroys the indicator is not being successfully run?

Flags: needinfo?(peterv)

Byron Campen [:bwc]

Comment 62

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c310e4f6e3f4167d397d2f439cc11ec997963195

Byron Campen [:bwc]

Comment 63

•

4 years ago

Hmm. So looking at this some more, in the cases where we hit the ysod (in dom/media/webrtc/tests/mochitests/test_1488832.html), I'm not seeing any output from this change, whereas I do see lots of output in the cases where there is not a failure. It seems that we are not getting as far as loading the JS code when we fail, at least not to init anyway.

Peter Van der Beken [:peterv]

Comment 64

•

4 years ago

(In reply to Byron Campen [:bwc] from comment #61)

So you're saying the ysod is the unsuccessfully loaded indicator, and the fact that the ysod is still around long after the test that tried to display the indicator means that the code that hides/destroys the indicator is not being successfully run?

I believe that's the case, yes.

(In reply to Byron Campen [:bwc] from comment #63)

Hmm. So looking at this some more, in the cases where we hit the ysod (in dom/media/webrtc/tests/mochitests/test_1488832.html), I'm not seeing any output from this change, whereas I do see lots of output in the cases where there is not a failure. It seems that we are not getting as far as loading the JS code when we fail, at least not to init anyway.

Well, if the document load is aborted before we even parse the script tag then the JS code wouldn't run. But why are we getting a NS_BINDING_ABORTED, what is causing the load to be aborted?

Flags: needinfo?(peterv)

Byron Campen [:bwc]

Comment 65

•

4 years ago

(In reply to Peter Van der Beken [:peterv] from comment #64)

(In reply to Byron Campen [:bwc] from comment #61)

So you're saying the ysod is the unsuccessfully loaded indicator, and the fact that the ysod is still around long after the test that tried to display the indicator means that the code that hides/destroys the indicator is not being successfully run?

I believe that's the case, yes.

(In reply to Byron Campen [:bwc] from comment #63)

Hmm. So looking at this some more, in the cases where we hit the ysod (in dom/media/webrtc/tests/mochitests/test_1488832.html), I'm not seeing any output from this change, whereas I do see lots of output in the cases where there is not a failure. It seems that we are not getting as far as loading the JS code when we fail, at least not to init anyway.

Well, if the document load is aborted before we even parse the script tag then the JS code wouldn't run. But why are we getting a NS_BINDING_ABORTED, what is causing the load to be aborted?

Because the gUM capture is done, and therefore we are closing the window for the indicator. If the window is not actually being closed, I do not know why, because webrtcUI.jsm has called close() on it and relinquished its reference:

https://treeherder.mozilla.org/logviewer?job_id=347555389&repo=try&lineNumber=9186
https://hg.mozilla.org/try/rev/95b33522d117d0824863e390d7e529dc951660df

Someone with more knowledge of the lifecycle of chrome windows probably needs to investigate this; this does not seem to be an issue that is specific to webrtc.

Flags: needinfo?(peterv)

Comment hidden (Intermittent Failures Robot)

Jim Mathies [:jimm]

Updated

•

4 years ago

Assignee: docfaraday → nobody

Status: ASSIGNED → NEW

Atila Butkovits

Comment 67

•

4 years ago

Attached file Bug 1703346 - disable test_getUserMedia_basicScreenshare on Linux_QR for frequent failures. r=#intermittent-reviewers — Details

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → abutkovits

Status: NEW → ASSIGNED

Comment hidden (Intermittent Failures Robot)

Pulsebot

Comment 69

•

4 years ago

Pushed by abutkovits@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/23ed0778a44b disable test_getUserMedia_basicScreenshare on Linux_QR for frequent failures. r=intermittent-reviewers,bhearsum DONTBUILD

Atila Butkovits

Updated

•

4 years ago

Assignee: abutkovits → nobody

Status: ASSIGNED → NEW

Keywords: leave-open

Whiteboard: ysod → ysod, [stockwell disabled]

Atila Butkovits

Comment 70

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/23ed0778a44b

Comment hidden (Intermittent Failures Robot)

Byron Campen [:bwc]

Updated

•

4 years ago

Keywords: stalled

Byron Campen [:bwc]

Updated

•

4 years ago

Comment 74

•

3 years ago

Recent failures here point to Bug 1767445, retriggers here.

Flags: needinfo?(sdowne)

Scott [:thecount] Downe

Comment 75

•

3 years ago

I'll take a look, but not sure where to begin. Maybe help with reproducing it locally so I can debug, or help understanding the error message.

Some context. The changes in Bug 1767445 is behind a pref that is preffed off and the pref is being added in this patch, and it is only making changes in newtab. From what I can tell that test is not running something on newtab.

So either, something is wrong with the pref or mechanism that is flipping and that somehow triggers an error in this test, or the pref mechanism isn't complete enough and some changes that are supposed to be off is somehow on, or some code that check for the state of the pref are incorrectly causing something to change, or could this be a false positive? All seem pretty unlikely looking at both my changes and the test. Either way I'll find some time to look more.

Flags: needinfo?(csabou)

Comment hidden (Intermittent Failures Robot)

Cosmin Sabou [:CosminS]

Comment 77

•

3 years ago

Sorry Scott, unfortunately I can't help you debug this. Maybe jib, gvn can help you with that. What can I say it's fails only in windows10-2004-qr 32/64 bits and both opt/debug with fission enable: https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2022-05-17&endday=2022-05-24&tree=trunk&bug=1703346.
And the test is skipped on:

[test_getUserMedia_basicScreenshare.html]
skip-if =
  toolkit == 'android'     # no screenshare on android
  apple_silicon            # bug 1707742
  apple_catalina           # platform migration

Flags: needinfo?(jib)

Flags: needinfo?(gsuntop)

Flags: needinfo?(csabou)

Comment hidden (Intermittent Failures Robot)

Scott [:thecount] Downe

Comment 79

•

3 years ago

•

Edited

Looking at the screenshot for the error: https://firefoxci.taskcluster-artifacts.net/B1AfXbg2RAOuGRZ2OWUtLQ/0/public/test_info/mozilla-test-fail-screenshot_x3_vrhda.png

There is a big "You Windows license will expire soon" right over the section of the test that is checking for colours, could this be what's causing the failure?

If so, the question to me is how much patch causes the Windows license dialog to show.

Flags: needinfo?(gsuntop)

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 80

•

3 years ago

•

Edited

(In reply to Scott [:thecount] Downe from comment #79)

There is a big "You Windows license will expire soon" right over the section of the test that is checking for colours, could this be what's causing the failure?

Yes, this is a screen capture test, so that is what's causing the failure. Specifically, the dimming of the page that appears behind the license warning is what's causing the failure since the test picks up faded red and faded gray instead of bright red and gray in the two tests I looked at.

If so, the question to me is how much patch causes the Windows license dialog to show.

This is not caused by any patch but by expired Windows licenses on some of our test machines. This happens from time to time, and test_getUserMedia_basicScreenshare.html is just the unlucky test that catches it.

Flags: needinfo?(jib)

Scott [:thecount] Downe

Comment 81

•

3 years ago

Great, so sounds like it's highly unlikely Bug 1767445 caused this, and we just had some false positives that pointed to my bug?

I do not need to investigate any further?

Flags: needinfo?(sdowne)

Cosmin Sabou [:CosminS]

Comment 82

•

3 years ago

Looking at the failure rate here it has stopped failing for now. https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2022-05-27&endday=2022-06-03&tree=trunk&bug=1703346

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

3 years ago

Blocks: 1781629

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

3 years ago

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → DUPLICATE

BugBot [:suhaib / :marco/ :calixte]

Updated

•

3 years ago

Keywords: leave-open

BugBot [:suhaib / :marco/ :calixte]

Comment 87

•

3 years ago

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.

Keywords: stalled

Peter Van der Beken [:peterv]

Updated

•

2 years ago

Flags: needinfo?(peterv)

Mathew Hodson

Updated

•

1 year ago

Updated

•

1 year ago

Bug 1703346: Log xml errors to the expatdriver log module. r?zbraniecki 4 years ago Byron Campen [:bwc] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1703346: Don't resume parse if \|status\| is failed. r?zbraniecki 4 years ago Byron Campen [:bwc] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1703346 - disable test_getUserMedia_basicScreenshare on Linux_QR for frequent failures. r=#intermittent-reviewers 4 years ago Atila Butkovits 48 bytes, text/x-phabricator-request		Details \| Review