Closed Bug 1334149 Opened 7 years ago Closed 7 years ago

Lots of intermittent failures for navigate() on Android which results in a hang until IOError (Process gets killed)

Categories

(Remote Protocol :: Marionette, defect)

Version 3
defect
Not set
normal

Tracking

(firefox52 fixed, firefox53 fixed, firefox54 fixed)

RESOLVED FIXED
mozilla54
Tracking Status
firefox52 --- fixed
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

Attachments

(2 files)

This is most likely the underlying issue of all remaining highly occurring Fennec intermittent failures we currently are seeing on integration branches. I filed a couple of bugs already which I will mark dependent on this one if appropriate.
Blocks: 1297394
Blocks: 1334142
Blocks: 1334141
Blocks: 1334034
Blocks: 1334033
Blocks: 1334032
Blocks: 1334031
So this is actually the cause for any of the process killed failures we have on integration branches those days. All of them fail in get(), means the proxy never terminates a get request and as such we run into the timeout.

We should keep in mind that this cannot be related to remoteness because for Fennec e10s is turned off and we also run the tests with --disable-e10s.

Marionette unit test for desktop which run with --disable-e10s are working fine. So this is only Fennec related.

I have a patch locally which could add a lot of logging to the get() method and how events are handled. I will push this to try with the hope that we could see some dump() output in the logs - but I doubt so.

Maybe its the same issue as I have mentioned on bug 1242595 comment 1, which is about the slow loading test page and a hang because we never fulfill the promise and the above hang in the proxy happens.
Andreas, could bug 1242595 help us here?
Flags: needinfo?(ato)
The try build with debug dump() statements included actually worked! Here the full gecko.log:

https://public-artifacts.taskcluster.net/PHNSNoQRQfav6lxbPl1qcw/0/public/test_info//gecko.log

Search for "test_should_be_able_to_switch_to_a_frame_by_its_index" and follow the output. We start a get() request for frameset.html for which we expect a load event. Then document loaded fired for the requested URL. Readystate by that time is "interactive". Then we see several further page loads for the contained frames. Later at 10:52:42 we get a "complete" ready state, and we remove the listeners. Finally we send the OK back to the proxy promise. But then it stops. No further communication is happening then.

To make it easier to reproduce for us I will attach a minimized testcase for the slow loading page, which demonstrates the exact same issue for Firefox at least in e10s mode. I haven't tested it with e10s disabled yet.
Looks like the problem with the slow loading page was related to a shorter socket timeout than the page load timeout. So I cannot reproduce it anymore since the patch on bug 1322277 has been landed.

I may have to add more debugging output and do some more try builds.
Attached file gecko.log (test fail)
Watching the Android tests for a day now, the failure which is most often occurring is in test_switch_frame.py TestSwitchFrame.test_should_be_able_to_switch_to_a_frame_by_its_index. So I will attach a log here with all the debug info I setup included.

https://public-artifacts.taskcluster.net/a9qYNwQJTcCjrrM8pfP6Gg/0/public/test_info//gecko.log

When checking the log I noticed that we seem to run the affected test twice, and somewhat in an arbitrary order:

01-27 04:35:59.158 I/Gecko   (  720): MARIONETTE LOG: INFO: TEST-START: test_switch_frame.py:test_should_be_able_to_switch_to_a_frame_by_its_index
[..]
01-27 04:37:23.598 I/Gecko   ( 1819): [1819] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file /home/worker/workspace/build/src/toolkit/crashreporter/nsExceptionHandler.cpp, line 1866
--------- beginning of /dev/log/radio
--------- beginning of /dev/log/main
01-27 04:33:47.958 I/Gecko   (  720): ** send broadcast delayed async message
[..]
01-27 04:33:49.878 I/Gecko   (  720): MARIONETTE LOG: INFO: TEST-START: test_execute_async_script.py:test_sandbox_refresh_arguments
[..]
01-27 04:33:50.728 I/Gecko   (  720): MARIONETTE LOG: INFO: TEST-END: test_execute_async_script.py:test_sandbox_refresh_arguments
[..]
lots of other tests
[..]
01-27 04:35:59.158 I/Gecko   (  720): MARIONETTE LOG: INFO: TEST-START: test_switch_frame.py:test_should_be_able_to_switch_to_a_frame_by_its_index
[..]
01-27 04:38:52.128 I/Gecko   ( 1819): MARIONETTE LOG: INFO: TEST-END: /home/worker/workspace/build/tests/marionette/tests/testing/marionette/harness/marionette_harness/tests/unit/test_switch_frame.py:test_should_be_able_to_switch_to_a_frame_by_its_index
01-27 04:38:52.128 I/Gecko   ( 1819): MARIONETTE LOG: INFO: TEST-END: /home/worker/workspace/build/tests/marionette/tests/testing/marionette/harness/marionette_harness/tests/unit/test_switch_frame.py:test_should_be_able_to_switch_to_a_frame_by_its_index
[..]

As you can see we have the TEST-START entry contained twice in that log. Both times are identical, same also applies for the TEST-END messages. Interesting is that there is a turn back in time after the crash reporter lines at 04:37:23.598 to 04:33:47.958. This is a difference of about 3:45min.

Not sure if that actually is a crash or not, but why do we have such a gap in our log? Btw the next time we reach 04:37:23.598 all is fine. So something is totally busted with logging here, which also makes it hard to debug the hang at all.
Geoff, maybe you have an idea? I don't really know anything about logging on Android, and what the above could mean.
Attachment #8832241 - Attachment description: gecko.log → gecko.log (test fail)
The get request gets started at 04:35:59.548, which expects a load event. The last document loaded for the frameset.html is at 04:36:32.039. Then the crash reporter lines are showing up, followed by:

> 01-27 04:37:34.498 I/Gecko   ( 1819): AndroidBridge::Init
> 01-27 04:37:34.498 I/Gecko   ( 1819): AndroidBridge::Init

When are we usually running this code?

https://dxr.mozilla.org/mozilla-central/rev/9c06e744b1befb3a2e2fdac7414ce18220774a1d/widget/android/AndroidBridge.cpp#160

Our internal timeout for page load is set to 60s so a forced-kill would happen at 04:36:59. Here it seems we have it at 04:37:23.378. So close to a half minute later. Maybe it's a delay because we run the tests in the emulator?

Btw a working page load looks like that:

> 01-27 04:42:32.895 I/Gecko   (  718): ************ requested url: http://172.17.0.5:43244/frameset.html
[..]
> 01-27 04:42:59.655 I/Gecko   (  718): ****** ready state: complete
> 01-27 04:42:59.675 I/Gecko   (  718): ****** complete for {951b71fc-5c3a-480f-bc0a-4d210b7f3500}
> 01-27 04:42:59.685 I/Gecko   (  718): ****** inside final callback to remove listeners
> 01-27 04:42:59.705 I/Gecko   (  718): ****** end final callback to remove listeners
> 01-27 04:42:59.735 I/Gecko   (  718): ****** send ok

So it takes about 27s to load the frameset.html as served by the host outside of the emulator. 

Maybe what we see here are networking issue with the emulator? Should we increase the page load timeout for Fennec tests to a higher value than 60s? Maybe it might be worth a try.
gecko.log seems to be a marionette log created by filtering the logcat for messages logged by "Gecko". I see https://dxr.mozilla.org/mozilla-central/rev/9c06e744b1befb3a2e2fdac7414ce18220774a1d/testing/marionette/client/marionette_driver/geckoinstance.py#329, but I'm not familiar with the code.

I'd be more comfortable looking at the full logcat; it's much larger and has lots of content generated by other Android components, but there's (almost) no special processing and it's usually pretty reliable. You can find the full logcat for any test job run with android_emulator_unittest (marionette included) in the treeherder "Job status" as "artifact: logcat-emulator-5554.log".
Thank you Geoff! I will keep this in mind for the current try push. As we talked about those failures yesterday I'm trying a new approach now which might hopefully workaround those issues. It means we want to bump up the current socket-timeout value from the default 65s to 360s. Also the page load timeout will be bumped to 120s. Lets see if that works:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=79e53d3e1b508eb15ff678067e23114b8b30694c

Ultimately we indeed want to get rid of this socket timeout, but there are other dependent issues first which would have to be fixed, like bug 1130220.
Flags: needinfo?(ato)
Good news! The try build doesn't show any existence of this IOError again! Even for lots of retriggered jobs. \o/

So my proposal would be:

1. Backout the patch on bug 1322277 on all landed branches, and keep the default page load timeout as specified by the WebDriver spec. This gives our tests enough time to actually get the requested page loaded. And this especially for the Fennec builds run in an awful slow emulator.

2. Revert the default socket timeout back to 360s which is larger than the default page load timeout, and also gives Marionette server an extra 60s window to send a reply with the timeout exception to the client. We should add a good explanation to this change, so we don't touch those values again, until we can remove the socket timeout setting completely.

The follow-up patch I would like to do on bug 1322277, so this bug would have to wait for it.

Andreas, are you with me on this?
Depends on: 1322277
Flags: needinfo?(ato)
Make it so.
Flags: needinfo?(ato)
...and document very prominently that the timeouts are so long because emulators, so that we don't shorten them yet again some day after we've forgotten all about this. ;)
Bug 1322277 has been backed out across all branches.
Assignee: nobody → hskupin
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Not fixed yet. We are still waiting for the new patch on bug 1322277 being landed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This is fixed for Firefox 54 now. Since the landing of the patch on bug 1322277 no more test failures as reported here occur!
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: