1334149 - Lots of intermittent failures for navigate() on Android which results in a hang until IOError (Process gets killed)

Assignee

Description

•

7 years ago

This is most likely the underlying issue of all remaining highly occurring Fennec intermittent failures we currently are seeing on integration branches. I filed a couple of bugs already which I will mark dependent on this one if appropriate.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Blocks: 1297394

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Blocks: 1334142

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Blocks: 1334141

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Blocks: 1334034

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Blocks: 1334033

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Blocks: 1334032

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Blocks: 1334031

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 1

•

7 years ago

So this is actually the cause for any of the process killed failures we have on integration branches those days. All of them fail in get(), means the proxy never terminates a get request and as such we run into the timeout.

We should keep in mind that this cannot be related to remoteness because for Fennec e10s is turned off and we also run the tests with --disable-e10s.

Marionette unit test for desktop which run with --disable-e10s are working fine. So this is only Fennec related.

I have a patch locally which could add a lot of logging to the get() method and how events are handled. I will push this to try with the hope that we could see some dump() output in the logs - but I doubt so.

Maybe its the same issue as I have mentioned on bug 1242595 comment 1, which is about the slow loading test page and a hang because we never fulfill the promise and the above hang in the proxy happens.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 2

•

7 years ago

Andreas, could bug 1242595 help us here?

Flags: needinfo?(ato)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 3

•

7 years ago

The try build with debug dump() statements included actually worked! Here the full gecko.log:

https://public-artifacts.taskcluster.net/PHNSNoQRQfav6lxbPl1qcw/0/public/test_info//gecko.log

Search for "test_should_be_able_to_switch_to_a_frame_by_its_index" and follow the output. We start a get() request for frameset.html for which we expect a load event. Then document loaded fired for the requested URL. Readystate by that time is "interactive". Then we see several further page loads for the contained frames. Later at 10:52:42 we get a "complete" ready state, and we remove the listeners. Finally we send the OK back to the proxy promise. But then it stops. No further communication is happening then.

To make it easier to reproduce for us I will attach a minimized testcase for the slow loading page, which demonstrates the exact same issue for Firefox at least in e10s mode. I haven't tested it with e10s disabled yet.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 4

•

7 years ago

Looks like the problem with the slow loading page was related to a shorter socket timeout than the page load timeout. So I cannot reproduce it anymore since the patch on bug 1322277 has been landed.

I may have to add more debugging output and do some more try builds.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 5

•

7 years ago

Attached file gecko.log (test fail) — Details

Watching the Android tests for a day now, the failure which is most often occurring is in test_switch_frame.py TestSwitchFrame.test_should_be_able_to_switch_to_a_frame_by_its_index. So I will attach a log here with all the debug info I setup included.

https://public-artifacts.taskcluster.net/a9qYNwQJTcCjrrM8pfP6Gg/0/public/test_info//gecko.log

When checking the log I noticed that we seem to run the affected test twice, and somewhat in an arbitrary order:

01-27 04:35:59.158 I/Gecko   (  720): MARIONETTE LOG: INFO: TEST-START: test_switch_frame.py:test_should_be_able_to_switch_to_a_frame_by_its_index
[..]
01-27 04:37:23.598 I/Gecko   ( 1819): [1819] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file /home/worker/workspace/build/src/toolkit/crashreporter/nsExceptionHandler.cpp, line 1866
--------- beginning of /dev/log/radio
--------- beginning of /dev/log/main
01-27 04:33:47.958 I/Gecko   (  720): ** send broadcast delayed async message
[..]
01-27 04:33:49.878 I/Gecko   (  720): MARIONETTE LOG: INFO: TEST-START: test_execute_async_script.py:test_sandbox_refresh_arguments
[..]
01-27 04:33:50.728 I/Gecko   (  720): MARIONETTE LOG: INFO: TEST-END: test_execute_async_script.py:test_sandbox_refresh_arguments
[..]
lots of other tests
[..]
01-27 04:35:59.158 I/Gecko   (  720): MARIONETTE LOG: INFO: TEST-START: test_switch_frame.py:test_should_be_able_to_switch_to_a_frame_by_its_index
[..]
01-27 04:38:52.128 I/Gecko   ( 1819): MARIONETTE LOG: INFO: TEST-END: /home/worker/workspace/build/tests/marionette/tests/testing/marionette/harness/marionette_harness/tests/unit/test_switch_frame.py:test_should_be_able_to_switch_to_a_frame_by_its_index
01-27 04:38:52.128 I/Gecko   ( 1819): MARIONETTE LOG: INFO: TEST-END: /home/worker/workspace/build/tests/marionette/tests/testing/marionette/harness/marionette_harness/tests/unit/test_switch_frame.py:test_should_be_able_to_switch_to_a_frame_by_its_index
[..]

As you can see we have the TEST-START entry contained twice in that log. Both times are identical, same also applies for the TEST-END messages. Interesting is that there is a turn back in time after the crash reporter lines at 04:37:23.598 to 04:33:47.958. This is a difference of about 3:45min.

Not sure if that actually is a crash or not, but why do we have such a gap in our log? Btw the next time we reach 04:37:23.598 all is fine. So something is totally busted with logging here, which also makes it hard to debug the hang at all.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 6

•

7 years ago

Geoff, maybe you have an idea? I don't really know anything about logging on Android, and what the above could mean.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 7

•

7 years ago

Attached file gecko.log (test pass) — Details

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Attachment #8832241 - Attachment description: gecko.log → gecko.log (test fail)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 8

•

7 years ago

The get request gets started at 04:35:59.548, which expects a load event. The last document loaded for the frameset.html is at 04:36:32.039. Then the crash reporter lines are showing up, followed by:

> 01-27 04:37:34.498 I/Gecko   ( 1819): AndroidBridge::Init
> 01-27 04:37:34.498 I/Gecko   ( 1819): AndroidBridge::Init

When are we usually running this code?

https://dxr.mozilla.org/mozilla-central/rev/9c06e744b1befb3a2e2fdac7414ce18220774a1d/widget/android/AndroidBridge.cpp#160

Our internal timeout for page load is set to 60s so a forced-kill would happen at 04:36:59. Here it seems we have it at 04:37:23.378. So close to a half minute later. Maybe it's a delay because we run the tests in the emulator?

Btw a working page load looks like that:

> 01-27 04:42:32.895 I/Gecko   (  718): ************ requested url: http://172.17.0.5:43244/frameset.html
[..]
> 01-27 04:42:59.655 I/Gecko   (  718): ****** ready state: complete
> 01-27 04:42:59.675 I/Gecko   (  718): ****** complete for {951b71fc-5c3a-480f-bc0a-4d210b7f3500}
> 01-27 04:42:59.685 I/Gecko   (  718): ****** inside final callback to remove listeners
> 01-27 04:42:59.705 I/Gecko   (  718): ****** end final callback to remove listeners
> 01-27 04:42:59.735 I/Gecko   (  718): ****** send ok

So it takes about 27s to load the frameset.html as served by the host outside of the emulator. 

Maybe what we see here are networking issue with the emulator? Should we increase the page load timeout for Fennec tests to a higher value than 60s? Maybe it might be worth a try.

Geoff Brown [:gbrown]

Comment 9

•

7 years ago

gecko.log seems to be a marionette log created by filtering the logcat for messages logged by "Gecko". I see https://dxr.mozilla.org/mozilla-central/rev/9c06e744b1befb3a2e2fdac7414ce18220774a1d/testing/marionette/client/marionette_driver/geckoinstance.py#329, but I'm not familiar with the code.

I'd be more comfortable looking at the full logcat; it's much larger and has lots of content generated by other Android components, but there's (almost) no special processing and it's usually pretty reliable. You can find the full logcat for any test job run with android_emulator_unittest (marionette included) in the treeherder "Job status" as "artifact: logcat-emulator-5554.log".

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 10

•

7 years ago

Thank you Geoff! I will keep this in mind for the current try push. As we talked about those failures yesterday I'm trying a new approach now which might hopefully workaround those issues. It means we want to bump up the current socket-timeout value from the default 65s to 360s. Also the page load timeout will be bumped to 120s. Lets see if that works:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=79e53d3e1b508eb15ff678067e23114b8b30694c

Ultimately we indeed want to get rid of this socket timeout, but there are other dependent issues first which would have to be fixed, like bug 1130220.

Flags: needinfo?(ato)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 11

•

7 years ago

Good news! The try build doesn't show any existence of this IOError again! Even for lots of retriggered jobs. \o/

So my proposal would be:

1. Backout the patch on bug 1322277 on all landed branches, and keep the default page load timeout as specified by the WebDriver spec. This gives our tests enough time to actually get the requested page loaded. And this especially for the Fennec builds run in an awful slow emulator.

2. Revert the default socket timeout back to 360s which is larger than the default page load timeout, and also gives Marionette server an extra 60s window to send a reply with the timeout exception to the client. We should add a good explanation to this change, so we don't touch those values again, until we can remove the socket timeout setting completely.

The follow-up patch I would like to do on bug 1322277, so this bug would have to wait for it.

Andreas, are you with me on this?

Depends on: 1322277

Flags: needinfo?(ato)

Andreas Tolfsen ❲:ato❳

Comment 12

•

7 years ago

Make it so.

Flags: needinfo?(ato)

Maja Frydrychowicz :impossibus (was :maja_zf) (needinfo me)

Comment 13

•

7 years ago

...and document very prominently that the timeouts are so long because emulators, so that we don't shorten them yet again some day after we've forgotten all about this. ;)

Ryan VanderMeulen [:RyanVM]

Comment 14

•

7 years ago

Bug 1322277 has been backed out across all branches.

Assignee: nobody → hskupin

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox52: --- → fixed

status-firefox53: --- → fixed

status-firefox54: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla54

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 15

•

7 years ago

Not fixed yet. We are still waiting for the new patch on bug 1322277 being landed.

Status: RESOLVED → REOPENED

status-firefox52: fixed → affected

status-firefox53: fixed → affected

status-firefox54: fixed → affected

Resolution: FIXED → ---

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 16

•

7 years ago

This is fixed for Firefox 54 now. Since the landing of the patch on bug 1322277 no more test failures as reported here occur!

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

status-firefox54: affected → fixed

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox52: affected → fixed

status-firefox53: affected → fixed

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

7 years ago

Blocks: 1348548

BMO Automation

Updated

•

1 year ago

Product: Testing → Remote Protocol

gecko.log (test fail) 7 years ago Henrik Skupin [:whimboo][⌚️UTC+1] 68.06 KB, text/plain		Details
gecko.log (test pass) 7 years ago Henrik Skupin [:whimboo][⌚️UTC+1] 7.56 KB, text/plain		Details