Open Bug 1527963 Opened 10 months ago Updated 4 months ago

Intermittent <random marionette test> | IOError: Port 127.0.0.1:2828 is unavailable ([Errno 98] Address already in use)

Categories

(Testing :: Mozbase, defect, P5)

Version 3
x86_64
Android
defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, regression)

#[markdown(off)]
Filed by: gbrown [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=228438591&repo=try

https://queue.taskcluster.net/v1/task/XJ2MxIRuRxOvXGBxbe2lOA/runs/0/artifacts/public/logs/live_backing.log

Intermittent failures seen on Android x86_64. Affects various marionette tests.

This actually looks like that adb forward is going down and because of that the connection request to 127.0.0.1:2828 ends-up on the host, where no Firefox/Marionette is running.

Sadly there are no log artifacts available which could prove that. Geoff, I think that is something which needs your or Bob's attention.

There are logcat-emulator-5554.log files available for each of the failing tests.

I don't see anything indicating an error there around the times of the error. The error is also not persistent in that subsequent tests appear to pass if I understand correctly. Is there anything that calls adb forward again to recover?

It would be helpful I think if start_binary/is_port_available didn't obfuscate the socket.error by catching the exception in is_port_available and raising a different exception in start_binary. is_port_available is only called by start_binary, can we just not catch the exception in is_port_available and remove the raise in start_binary and get the real error in the log?

OS: Unspecified → Android
Hardware: Unspecified → x86_64

Geoff, would you mind triggering a new try build? Hopefully we will now see what the real error message is.

Flags: needinfo?(gbrown)

For reference here the traceback:

[task 2019-02-21T16:08:41.345Z] 16:08:41 INFO - Traceback (most recent call last):
[task 2019-02-21T16:08:41.345Z] 16:08:41 INFO - File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/marionette_test/testcases.py", line 140, in run
[task 2019-02-21T16:08:41.345Z] 16:08:41 INFO - self.setUp()
[task 2019-02-21T16:08:41.345Z] 16:08:41 INFO - File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/marionette_test/testcases.py", line 331, in setUp
[task 2019-02-21T16:08:41.345Z] 16:08:41 INFO - super(MarionetteTestCase, self).setUp()
[task 2019-02-21T16:08:41.346Z] 16:08:41 INFO - File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_harness/marionette_test/testcases.py", line 254, in setUp
[task 2019-02-21T16:08:41.346Z] 16:08:41 INFO - self.marionette.start_session()
[task 2019-02-21T16:08:41.347Z] 16:08:41 INFO - File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/decorators.py", line 26, in _
[task 2019-02-21T16:08:41.347Z] 16:08:41 INFO - return func(*args, **kwargs)
[task 2019-02-21T16:08:41.347Z] 16:08:41 INFO - File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1274, in start_session
[task 2019-02-21T16:08:41.347Z] 16:08:41 INFO - self.start_binary(timeout)
[task 2019-02-21T16:08:41.347Z] 16:08:41 INFO - File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/marionette.py", line 649, in start_binary
[task 2019-02-21T16:08:41.347Z] 16:08:41 INFO - reraise(IOError, msg, tb)
[task 2019-02-21T16:08:41.347Z] 16:08:41 INFO - File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/marionette.py", line 645, in start_binary
[task 2019-02-21T16:08:41.347Z] 16:08:41 INFO - self.check_port_available(self.port, host=self.host)
[task 2019-02-21T16:08:41.347Z] 16:08:41 INFO - File "/builds/worker/workspace/build/venv/lib/python2.7/site-packages/marionette_driver/marionette.py", line 693, in check_port_available
[task 2019-02-21T16:08:41.347Z] 16:08:41 INFO - s.bind((host, port))
[task 2019-02-21T16:08:41.347Z] 16:08:41 INFO - File "/usr/lib/python2.7/socket.py", line 228, in meth
[task 2019-02-21T16:08:41.347Z] 16:08:41 INFO - return getattr(self._sock,name)(*args)
[task 2019-02-21T16:08:41.347Z] 16:08:41 INFO - TEST-INFO took 104ms

Strange. So for Fennec we don't have any active test which shutdown or restarts Fennec. As such I wouldn't expect that start_session calls into start_binary again:

https://dxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette_driver/marionette.py#1268

I can only assume that self.instance.runner.returncode (mozrunner) gives a wrong returncode, or Fennec quit due to some reason. Note that AFAIK we don't analyze crash reports for Fennec yet.

Maybe lets add some more debug logging in mozrunner and Marionette client to see the returncode.

Geoff, would you have the time to do it?

Summary: Intermittent <random marionette test> | IOError: Port 127.0.0.1:2828 is unavailable. → Intermittent <random marionette test> | IOError: Port 127.0.0.1:2828 is unavailable ([Errno 98] Address already in use)

Sure, I'll take a look.

I looked at the logcat and didn't see anything that indicated a crash of fennec or anything.

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=04e8e4a0bd3d9d5d985d07f72208e114a4fe9b62

I think pidof intermittently returns no output even though fennec is running fine throughout.

Wait, now I see it. The exit code 0 is a normal shutdown, so no crash involved. So sorry for not checking the former test method in detail at the first place. But here it is:

https://searchfox.org/mozilla-central/rev/b36e97fc776635655e84f2048ff59f38fa8a4626/testing/marionette/harness/marionette_harness/tests/unit/test_timeouts.py#69-88

There we are calling indeed marionette.quit(), followed by a start_session(). But that should only run when Marionette manages the instance. AFAIK this is not the case for Android. Also there is a skip_if_mobile which means that something incorrectly assumes that we are not on mobile. This decorator is defined here:

https://searchfox.org/mozilla-central/rev/b36e97fc776635655e84f2048ff59f38fa8a4626/testing/marionette/harness/marionette_harness/marionette_test/decorators.py#142

Now I wonder why this failure is intermittent and not permanent! As it looks like we may have a wrongly set browserName capability.

Given that we do not skip this method Fennec quits, and we fail to start Fennec again because the port was still used. Interestingly the process is gone and as such we should be able to claim that port.

For now we should just figure out why we don't skip the test.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #10)

That test is normally shown as skipped, but start_session is called and the returncode checked anyway:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=229819592&repo=try&lineNumber=3387-3390

[task 2019-02-22T04:08:26.855Z] 04:08:26 INFO - TEST-START | testing/marionette/harness/marionette_harness/tests/unit/test_timeouts.py TestTimeouts.test_reset_timeout
[task 2019-02-22T04:08:26.855Z] 04:08:26 INFO - ZZZ start_session called
[task 2019-02-22T04:08:26.957Z] 04:08:26 INFO - ZZZ mozrunner returncode finds org.mozilla.fennec_aurora exists
[task 2019-02-22T04:08:26.982Z] 04:08:26 INFO - TEST-SKIP | testing/marionette/harness/marionette_harness/tests/unit/test_timeouts.py TestTimeouts.test_reset_timeout | took 127ms

We hit this condition for a variety of tests, not all of which are skipped:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=229819592&repo=try&lineNumber=3452

testing/marionette/harness/marionette_harness/tests/unit/test_switch_frame.py TestSwitchFrame.test_switch_to_parent_frame | IOError: Port 127.0.0.1:2828 is unavailable ([Errno 98] Address already in use)

So, it would be nice to skip the start_session, etc for skipped tests, but that won't solve this problem.

Also, in all these cases, it is clear that one instance of Fennec is running throughout: The logcats show the same fennec pid at the start and end of the logcat.

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=229811363&repo=try&lineNumber=3389

https://taskcluster-artifacts.net/Uhwbj4e8TmqjIQoXl6bOxw/0/public/test_info//logcat-emulator-5554.log

02-22 03:09:19.470 1322 1334 I ActivityManager: Start proc 2394:org.mozilla.fennec_aurora/u0a62 for activity org.mozilla.fennec_aurora/org.mozilla.gecko.BrowserApp
02-22 03:09:19.560 2394 2394 I GeckoApplication: zerdatime 75273 - application start
...
02-22 03:17:08.600 2394 2394 D GeckoBrowserApp: BrowserApp.onTabChanged: 7: THUMBNAIL
02-22 03:17:10.430 2394 2418 I Gecko : 1550805430430 Marionette DEBUG Closed connection 751

and yet:

[task 2019-02-22T03:13:53.749Z] 03:13:53 INFO - TEST-START | testing/marionette/harness/marionette_harness/tests/unit/test_timeouts.py TestTimeouts.test_reset_timeout
[task 2019-02-22T03:13:53.749Z] 03:13:53 INFO - ZZZ start_session called
[task 2019-02-22T03:13:53.853Z] 03:13:53 INFO - ZZZ _pidof pidof found no pids matching org.mozilla.fennec_aurora!!
[task 2019-02-22T03:13:53.854Z] 03:13:53 INFO - ZZZ mozrunner returncode finds org.mozilla.fennec_aurora does not exist!
[task 2019-02-22T03:13:53.855Z] 03:13:53 INFO - ZZZ start_session calling start_binary because returncode is 0

so I am pretty sure the main problem is that mozdevice relies on pidof, but pidof is flaky.

Depends on: 1529960
Assignee: nobody → gbrown
Status: NEW → RESOLVED
Closed: 10 months ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1529960
Component: Marionette → Mozbase

One-off failures like comment 15 might be a temporary infra issue or have other causes unrelated to the pidof issue which I was involved in, bug 1529960.

Assignee: gbrown → nobody
You need to log in before you can comment on or make changes to this bug.