Intermittent test_navigation.py TestPageLoadStrategy.test_eager | AssertionError: 'interactive' != u'complete'

RESOLVED WONTFIX

Status

defect
P3
normal
RESOLVED WONTFIX
2 years ago
4 months ago

People

(Reporter: aryx, Unassigned)

Tracking

(Depends on 1 bug, {intermittent-failure, mobile})

Version 3
Unspecified
Android
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox56 disabled, firefox57 disabled)

Details

(Whiteboard: [stockwell disabled][android disabled])

Attachments

(1 attachment)

+++ This bug was initially created as a clone of Bug #1364245 +++

https://treeherder.mozilla.org/logviewer.html#?job_id=102520612&repo=autoland

[task 2017-05-27T04:44:15.247993Z] 04:44:15     INFO -  TEST-START | test_navigation.py TestPageLoadStrategy.test_eager
[task 2017-05-27T04:44:48.297485Z] 04:44:48  WARNING -  TEST-UNEXPECTED-FAIL | test_navigation.py TestPageLoadStrategy.test_eager | AssertionError: 'interactive' != u'complete'
[task 2017-05-27T04:44:48.298563Z] 04:44:48     INFO -  Traceback (most recent call last):
[task 2017-05-27T04:44:48.298638Z] 04:44:48     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_harness/marionette_test/testcases.py", line 166, in run
[task 2017-05-27T04:44:48.299675Z] 04:44:48     INFO -      testMethod()
[task 2017-05-27T04:44:48.299739Z] 04:44:48     INFO -    File "/home/worker/workspace/build/tests/marionette/tests/testing/marionette/harness/marionette_harness/tests/unit/test_navigation.py", line 634, in test_eager
[task 2017-05-27T04:44:48.299780Z] 04:44:48     INFO -      self.assertEqual("interactive", self.ready_state)
[task 2017-05-27T04:44:48.299827Z] 04:44:48     INFO -  TEST-INFO took 33049ms
1495875484401	Marionette	TRACE	1054 -> [0,2,"get",{"url":"http://127.0.0.1:48970/slow_resource.html"}]
1495875484408	Marionette	DEBUG	Received DOM event "beforeunload" for "http://127.0.0.1:48970/windowHandles.html"
1495875484429	Marionette	DEBUG	Received DOM event "pagehide" for "http://127.0.0.1:48970/windowHandles.html"
1495875484430	Marionette	DEBUG	Received DOM event "unload" for "http://127.0.0.1:48970/windowHandles.html"
1495875484509	Marionette	DEBUG	Received DOM event "DOMContentLoaded" for "http://127.0.0.1:48970/slow_resource.html"
1495875484517	Marionette	TRACE	1054 <- [1,2,null,{}]
1495875489855	Marionette	TRACE	1054 -> [0,3,"executeScript",{"scriptTimeout":null,"newSandbox":true,"args":[],"filename":"test_navigation.py","script":"return window.document.readyState;","sandbox":null,"line":101}]
1495875489905	Marionette	TRACE	1054 <- [1,3,null,{"value":"complete"}]

There is a strange 5.3s gap between receiving the result for the `get` command, and actually submitting the `executeScript` commands. 

Nothing in the test actually shows that:

> self.marionette.navigate(self.test_page_slow_resource)
> self.assertEqual("interactive", self.ready_state)

Andreas, I assume that this happens because we log all the details of the current command in the content process? If that is really the case, why is that necessary and cannot be done in the parent process? I see this as a huge issue in tracing output.
Flags: needinfo?(ato)
(In reply to Henrik Skupin (:whimboo) from comment #1)

> There is a strange 5.3s gap between receiving the result for the `get`
> command, and actually submitting the `executeScript` commands.
>
> Nothing in the test actually shows that:
>
> > self.marionette.navigate(self.test_page_slow_resource)
> > self.assertEqual("interactive", self.ready_state)

It’s hard to tell why it takes over five seconds before the client
sends the executeScript command.  Open questions are if the Python
client is doing something in the mean time, if it’s trying to send the
next message but being blocked by the socket, if the Firefox process is
busy with something else, or if the logging output is being delayed.

You could try adding some print statements to the client, especially
transport.py, and dump statements to server.js.

> Andreas, I assume that this happens because we log all the details
> of the current command in the content process? If that is really
> the case, why is that necessary and cannot be done in the parent
> process? I see this as a huge issue in tracing output.

If you mean the TRACE messages, they are logged from
the main process in testing/marionette/server.js:629
(https://searchfox.org/mozilla-central/rev/31eec6f59eb303adf3318058cb30b
d6a883115a5/testing/marionette/server.js#629-633).
Flags: needinfo?(ato)
(In reply to Andreas Tolfsen ‹:ato› from comment #2)
> You could try adding some print statements to the client, especially
> transport.py, and dump statements to server.js.

Good idea. When we see a spike in those failures I will certainly do that. For now it seems that the failure has gone.

> > Andreas, I assume that this happens because we log all the details
> > of the current command in the content process? If that is really
> > the case, why is that necessary and cannot be done in the parent
> > process? I see this as a huge issue in tracing output.
> 
> If you mean the TRACE messages, they are logged from
> the main process in testing/marionette/server.js:629
> (https://searchfox.org/mozilla-central/rev/31eec6f59eb303adf3318058cb30b
> d6a883115a5/testing/marionette/server.js#629-633).

Ah! In that case it might be indeed an issue with the socket, that it is busy and data transfer is getting delayed. 

Thanks.
On bug 1378191 I figured out that we reset the capabilities including the page load strategy for each reload of the listener script. Maybe this is somehow involved here.
this is primarily on android debug.

:automatedtester, this is another intermittent that has increased in frequency and is trending towards a high frequency bug- worth looking into if it is an easy fix for someone on the Marionette team.
Flags: needinfo?(dburns)
There is extra code going in that has logging especially around a race condition. We need to review after a couple pushes after bug 1368434
(In reply to David Burns :automatedtester from comment #13)
> There is extra code going in that has logging especially around a race
> condition. We need to review after a couple pushes after bug 1368434

That shouldn't make a difference because it's only relevant for e10s. On Android we don't have support for that.

This is still a timing issue:

07-16 10:29:51.782 I/Gecko   (  724): 1500226191783	Marionette	DEBUG	Received DOM event "DOMContentLoaded" for "http://172.17.0.5:43918/slow_resource.html"
07-16 10:29:54.782 I/Gecko   (  724): 1500226194787	Marionette	TRACE	97 <- [1,2,null,{}]
07-16 10:29:55.903 I/Gecko   (  724): --DOMWINDOW == 23 (0x57b81000) [pid = 724] [serial = 282] [outer = 0x0] [url = http://172.17.0.5:43918/test.html]
07-16 10:29:55.913 I/Gecko   (  724): --DOMWINDOW == 22 (0x5494ec00) [pid = 724] [serial = 287] [outer = 0x0] [url = about:blank]
07-16 10:29:56.152 I/Gecko   (  724): 1500226196158	Marionette	TRACE	97 -> [0,3,"executeScript",{"scriptTimeout":null,"newSandbox":true,"args":[],"filename":"test_navigation.py","script":"return window.document.readyState;","sandbox":null,"line":107}]
07-16 10:29:56.472 I/Gecko   (  724): 1500226196476	Marionette	TRACE	97 <- [1,3,null,{"value":"complete"}]

The duration of the execute_script call is simply too long compared to when the document is fully loaded. I would propose that we increase the delay of loading the image in slow_resource.html a bit more. I believe 8s should be fine.

Otherwise we could only skip this test on Android.
Oh, we call execute_script via content scope. Maybe using chrome, and querying the content browser for it might also speed it up significantly. This would avoid the internal message manager usage.
Attachment #8889370 - Flags: review?(dburns) → review?(james)
Comment on attachment 8889370 [details]
Bug 1368227 - Disabling Marionette navigation eager test on Android.

https://reviewboard.mozilla.org/r/160432/#review165740
Attachment #8889370 - Flags: review?(james) → review+
Pushed by dburns@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/318343417034
Disabling Marionette navigation eager test on Android. r=jgraham
Flags: needinfo?(dburns)
Whiteboard: [stockwell disabled]
https://hg.mozilla.org/mozilla-central/rev/318343417034
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
The remaining underlying problem here is a delayed execution of commands. I filed bug 1386549 for tracking purposes.
Status: REOPENED → NEW
Whiteboard: [stockwell disabled] → [stockwell disabled][android disabled]
Keywords: leave-open
Target Milestone: mozilla56 → ---
Priority: -- → P3

Lets try to unskip once Mn jobs run with the faster Android 7.0 x86 emulator (bug 1500509).

OS: Unspecified → Android

We moved away from the ARM platform and this failure no longer occurs on packet.net with Android 7.0 x86_64.

Status: NEW → RESOLVED
Closed: 2 years ago4 months ago
Keywords: test-disabled
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.