Closed Bug 1791216 Opened 3 years ago Closed 4 months ago

Intermittent Android WD | unknown error (500): adb error: cannot bind listener: Address already in use

Categories

(Remote Protocol :: Marionette, defect, P5)

All
Android
defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: assertion, intermittent-failure)

Filed by: mlaza [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=390685924&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/G8zWCIART_yW2NbwE5tW1A/runs/0/artifacts/public/logs/live_backing.log


[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - TEST-PASS | /webdriver/tests/new_session/create_firstMatch.py | test_valid[timeouts-None] 
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - TEST-UNEXPECTED-FAIL | /webdriver/tests/new_session/create_firstMatch.py | test_valid[timeouts-value11] - AssertionError: unknown error (500): adb error: cannot bind listener: Address already in use
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - new_session = <function fixture_new_session.<locals>.new_session at 0x7fd762873950>
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - add_browser_capabilities = <function fixture_add_browser_capabilities.<locals>.add_browser_capabilities at 0x7fd762873c80>
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - key = 'timeouts', value = {}
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - 
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO -     @pytest.mark.parametrize("key,value", flatten(product(*item) for item in valid_data))
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO -     def test_valid(new_session, add_browser_capabilities, key, value):
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO -         response, _ = new_session({"capabilities": {
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO -             "firstMatch": [add_browser_capabilities({key: value})]}})
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - >       assert_success(response)
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - 
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - _          = None
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - add_browser_capabilities = <function fixture_add_browser_capabilities.<locals>.add_browser_capabilities at 0x7fd762873c80>
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - key        = 'timeouts'
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - new_session = <function fixture_new_session.<locals>.new_session at 0x7fd762873950>
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - response   = <Response status=500 error=<UnknownErrorException http_status=500>>
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - value      = {}
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - 
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - tests/web-platform/tests/webdriver/tests/new_session/create_firstMatch.py:16: 
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[task 2022-09-16T20:52:28.683Z] 20:52:28     INFO - 
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO - response = <Response status=500 error=<UnknownErrorException http_status=500>>
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO - value = None
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO - 
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO -     def assert_success(response, value=None):
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO -         """
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO -         Verify that the provided webdriver.Response instance described
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO -         a valid success response as defined by `dfn-send-a-response` and
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO -         the provided response value.
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO -     
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO -         :param response: ``webdriver.Response`` instance.
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO -         :param value: Expected value of the response body, if any.
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO -         """
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO - >       assert response.status == 200, str(response.error)
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO - E       AssertionError: unknown error (500): adb error: cannot bind listener: Address already in use
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO - E         
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO - E       assert 500 == 200
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO - E         +500
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO - E         -200
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO - 
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO - response   = <Response status=500 error=<UnknownErrorException http_status=500>>
[task 2022-09-16T20:52:28.684Z] 20:52:28     INFO - value      = None
[task 2022-09-16T20:52:28.685Z] 20:52:28     INFO - 
[task 2022-09-16T20:52:28.685Z] 20:52:28     INFO - tests/web-platform/tests/webdriver/tests/support/asserts.py:69: AssertionError
[task 2022-09-16T20:52:28.689Z] 20:52:28     INFO - ...............
[task 2022-09-16T20:52:28.689Z] 20:52:28     INFO - TEST-OK | /webdriver/tests/new_session/create_firstMatch.py | took 330198ms

Hi Henrik! I made this a general bug, but can you please take a look at this? Maybe I was wrong in doing that, if so, please NI me and I'll file a separate bug for this.

Flags: needinfo?(hskupin)
Summary: Intermittent /webdriver/tests/new_session/create_firstMatch.py | test_valid[timeouts-value11] - AssertionError: unknown error (500): adb error: cannot bind listener: Address already in use → Intermittent Android WD | unknown error (500): adb error: cannot bind listener: Address already in use

It's a low volume tier2 failure and to me it's not clear where it's coming from. Maybe it's a side-effect from moving from AWS to GCP as well. Given the low number of failures we won't work on it for now.

Flags: needinfo?(hskupin)
Duplicate of this bug: 1802074
No longer duplicate of this bug: 1802074
Duplicate of this bug: 1802131
Duplicate of this bug: 1803462
Duplicate of this bug: 1805376
Moving bug to Remote Protocol::Marionette component per bug 1815831.
Component: geckodriver → Marionette
Product: Testing → Remote Protocol
Duplicate of this bug: 1824563
Depends on: 1571426
See Also: 1571426

So even with bug 1571426 fixed this is still a problem on Android:

https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=wd&fromchange=7ef5101e80536201ddabd23a5cc9a01bbf1a5226&selectedTaskRun=SvVHfBcgRYKtNB_7EMkRgQ.0

Maybe it's related to the adb forward ports. Lets observe he failure rate for now.

Based on the failure it should be an issue with adb forward but given the following log we correctly stop the forward in the previous test:
https://treeherder.mozilla.org/logviewer?job_id=468196226&repo=autoland&lineNumber=64903-64925

INFO - PID 3567 | 1722075139369	geckodriver::android	DEBUG	Marionette port forward (37251 -> 2829) stopped
INFO - PID 3567 | 1722075139369	webdriver::server	DEBUG	<- 200 OK {"value":null}
INFO - STDOUT: tests/web-platform/tests/webdriver/tests/classic/new_session/unhandled_prompt_behavior.py::test_unhandled_prompt_behavior_as_object[beforeUnload-dismiss]
INFO - PID 3567 | 1722075139372	webdriver::server	DEBUG	-> POST /session {"capabilities": {"alwaysMatch": {"moz:firefoxOptions": {"args": ["--profile", "/tmp/tmpc7tddl3o.mozrunner"] ... , "MOZ_DISABLE_NONLOCAL_CONNECTIONS": "1", "RUST_BACKTRACE": "1"}}, "unhandledPromptBehavior": {"beforeUnload": "dismiss"}}}}
INFO - PID 3567 | 1722075139373	mozdevice	TRACE	adb server response was "emulator-5554          device product:sdk_phone_x86_64 model:Android_SDK_built_for_x86_64 device:generic_x86_64 transport_id:2\n"
INFO - PID 3567 | 1722075139373	mozdevice	TRACE	execute_host_command: >> "host:transport:emulator-5554"
INFO - PID 3567 | 1722075139373	mozdevice	TRACE	execute_host_command: << []
INFO - PID 3567 | 1722075139373	mozdevice	TRACE	execute_host_command: >> "shell:id"
INFO - PID 3567 | 1722075139386	mozdevice	TRACE	execute_host_command: << "uid=0(root) gid=0(root) groups=0(root),1004(input),1007(log),1011(adb),1015(sdcard_rw),1028(sdcard_r),3001(net_bt_admin),3002(net_bt),3003(inet),3006(net_bw_stats),3009(readproc) context=u:r:su:s0\n"
INFO - PID 3567 | 1722075139386	mozdevice	TRACE	execute_host_command: >> "host:transport:emulator-5554"
INFO - PID 3567 | 1722075139386	mozdevice	TRACE	execute_host_command: << []
INFO - PID 3567 | 1722075139386	mozdevice	TRACE	execute_host_command: >> "shell:su 0 id"
INFO - PID 3567 | 1722075139399	mozdevice	TRACE	execute_host_command: << "uid=0(root) gid=0(root) groups=0(root),1004(input),1007(log),1011(adb),1015(sdcard_rw),1028(sdcard_r),3001(net_bt_admin),3002(net_bt),3003(inet),3006(net_bw_stats),3009(readproc) context=u:r:su:s0\n"
INFO - PID 3567 | 1722075139399	mozdevice	TRACE	execute_host_command: >> "host:transport:emulator-5554"
INFO - PID 3567 | 1722075139400	mozdevice	TRACE	execute_host_command: << []
INFO - PID 3567 | 1722075139400	mozdevice	TRACE	execute_host_command: >> "shell:su -c id"
INFO - PID 3567 | 1722075139407	mozdevice	TRACE	execute_host_command: << "su: invalid uid/gid '-c'\n"
INFO - PID 3567 | 1722075139407	mozdevice	INFO	Device is rooted
INFO - PID 3567 | 1722075139407	mozdevice	TRACE	execute_host_command: >> "host:transport:emulator-5554"
INFO - PID 3567 | 1722075139407	mozdevice	TRACE	execute_host_command: << []
INFO - PID 3567 | 1722075139407	mozdevice	TRACE	execute_host_command: >> "shell:setenforce permissive"
INFO - PID 3567 | 1722075139415	mozdevice	TRACE	execute_host_command: << ""
INFO - PID 3567 | 1722075139415	webdriver::server	DEBUG	<- 500 Internal Server Error {"value":{"error":"unknown error","message":"adb error: cannot bind listener: Address already in use","stacktrace":""}}

As a first todo action we could actually change the log output for Marionette port forward (34975 -> 2829) started to Starting Marionette port forward (34975 -> 2829) so that we know which ports will actually be used. The same we could do for stopping the adb forward. If it fails we would see it anyway.

From a recent log:

[task 2024-08-07T15:58:20.263Z] 15:58:20     INFO - PID 2228 | 1723046300262	geckodriver::android	DEBUG	Start forwarding Marionette port (42092 -> 2829)
[task 2024-08-07T15:58:20.263Z] 15:58:20     INFO - PID 2228 | 1723046300262	webdriver::server	DEBUG	<- 500 Internal Server Error {"value":{"error":"unknown error","message":"adb error: cannot bind listener: Address already in use","stacktrace":""}}

We are stopping the forward to port 2829 in the former test a couple of lines above. So maybe it's related to the local port 42092? Querying for it in the log doesn't reveal any further instance. So not sure what this could be.

It might probably help if we would run adb forward --list in case of a failure creating the port forward. That way we could see if the source or target port is still in use.

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #114)

It might probably help if we would run adb forward --list in case of a failure creating the port forward. That way we could see if the source or target port is still in use.

I checked locally what happens when the source or target port are already in use and I cannot get the adb forward command to fail in any situation. As it looks like it is not related to the availability of any of these ports.

As such it's a bit unclear to me what exactly triggers this issue or if on Linux the adb command would fail but not on MacOS. Otherwise maybe there is an issue with the emulator the tests run in?

Andrew, do you maybe have an idea?

Flags: needinfo?(aerickson)
OS: Unspecified → Android
Hardware: Unspecified → All

I think that this could be flakiness in the Docker layer.

The docker daemon has to forward 7-8 devices-worth of USB operations to it's containers and I think it can become overwhelmed. We have to restart the docker processes once and awhile (we'll see clusters of failures among devices on a specific docker host).

Not sure what the long-term solution is, perhaps less-dense docker hosts with only 4 devices each. This is out of our control as our vendor Bitbar controls that layer, but we can definitely let them know we're having issues.

Flags: needinfo?(aerickson)

(In reply to Andrew Erickson [:aerickson] from comment #117)

Not sure what the long-term solution is, perhaps less-dense docker hosts with only 4 devices each. This is out of our control as our vendor Bitbar controls that layer, but we can definitely let them know we're having issues.

Thank you for the explanation Andrew! I would appreciate if you could get this kind of failure forwarded to Bitbar. Maybe there is an easy solution to get it fixed or there is a workaround eg. trying multiple times to setup the port forwarding (I didn't try that yet). Thanks!

Duplicate of this bug: 1916251

Andrew, could you please help us in forwarding this issue to Bitbar? Thanks in advance.

Flags: needinfo?(aerickson)

Yeah, will do.

I'm not sure if retrying will help, but it could.

Flags: needinfo?(aerickson)

Hi Andrew, I wanted to check if you had the time to inform Bitbar about this issue. It's still happening quite frequently to us. Thanks.

Flags: needinfo?(aerickson)

These failures are not on Bitbar devices (they would have an android-hw prefix).

android-em tasks are not run on Bitbar, but on GCP hosts (gecko-t/t-linux-kvm-noscratch-gcp is the worker pool) running our very old docker-worker images that run a docker container. It looks like they're using our older 18.04 Docker image.

I don't know why adb port forwarding would fail inside these... they are on an old version of docker and TC docker-worker engine (that we're trying to replace).

You could try migrating the tasks to a newer docker image or to a newer d2g-based worker-pool that would have a newer Docker version (releng/jcristau would be the person to talk to if you'd like to try d2g).

Flags: needinfo?(aerickson)

(In reply to Andrew Erickson [:aerickson] from comment #139)

These failures are not on Bitbar devices (they would have an android-hw prefix).

Oh! So we had a red herring all the last weeks. That is good to know. Thanks!

android-em tasks are not run on Bitbar, but on GCP hosts (gecko-t/t-linux-kvm-noscratch-gcp is the worker pool) running our very old docker-worker images that run a docker container. It looks like they're using our older 18.04 Docker image.

I don't know why adb port forwarding would fail inside these... they are on an old version of docker and TC docker-worker engine (that we're trying to replace).

You could try migrating the tasks to a newer docker image or to a newer d2g-based worker-pool that would have a newer Docker version (releng/jcristau would be the person to talk to if you'd like to try d2g).

I'm not sure what could be done or what is chosen for the Android wpt jobs (specifically wdspec) in the future. Joel, are there any plans to upgrade the worker for these jobs? If yes, what is the proposed worker pool? I also don't know what d2g actually stands for. Thanks!

Flags: needinfo?(jmaher)

there are no plans to upgrade our android emulators. We have a lot of work scheduled for macosx and linux that will consume all of H1 2025. I would like to see android emulators upgraded at some point- newer docker image, newer emulator software (if realistic), and newer version of android. That is something in H2 if we are lucky, a lot of unknowns there- it might be easy, but who knows.

d2g is some taskcluster term. I am not sure exactly what it is, but I believe it is something related for docker to generic [worker]. I don't understand why that is needed and we just don't run generic worker on everything. I am abstracted from taskcluster and only familiar with what configurations and scripts are in mozilla-central.

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #141)

d2g is some taskcluster term. I am not sure exactly what it is, but I believe it is something related for docker to generic [worker].

Sorry, yeah that's correct. The TC team has deprecated the docker-worker (d-w) and all Docker-based workloads are moving to pools that run the newer generic-worker (g-w) with d2g support. d2g (support) means it can handle the older d-w task payloads.

I don't understand why that is needed and we just don't run generic worker on everything. I am abstracted from taskcluster and only familiar with what configurations and scripts are in mozilla-central.

We have g-w pools that run Dockerized and non-dockerized workloads. We definitely don't need to run in containers, it just allows the task author to run in a customized and controlled environment (OS, packages, etc can be speficied) vs the relatively-bare OS that RelSRE provides. Docker adds complexity that might not be needed or desired.

So would those other works be more reliable in regards of this failure? If not we should just keep what we currently have and maybe further investigate if possible by using a loaner.

Yes, I think both methods could be potential solutions for the issue. If this error rate bothers you, I think it's definitely worth some investigation (and moving the jobs to a newer OS sounds like a good idea).

  1. If you'd like to stay with a Docker image, I'd chat with :jcristau on releng. He's working on migrating tasks to the new d2g workers (that have a newer Docker backend that might fix the issue). We're still working out the bugs with g-w d2g functionality and the worker image, so that's an additional risk with this option.

  2. Trying a bare OS worker would allow you to rule out Docker as a potential source of the issue. This option has a bit more work up front, but the infrastructure is up and working now and it frees you from Docker (good and bad, mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1791216#c142).

(In reply to Andrew Erickson [:aerickson] from comment #144)

Yes, I think both methods could be potential solutions for the issue. If this error rate bothers you, I think it's definitely worth some investigation (and moving the jobs to a newer OS sounds like a good idea).

If do not know the plans for the future on how to run tests on Android. What I can say is that we want to run the tests similarly to all the other ones. Joel, can you give an advise on what to do?

Flags: needinfo?(jmaher)

These failures are more specifically android-emulators (we don't run wdspec tests on hardware as it isn't fully required).

There are a few variables:

  1. base OS that runs the emulator (Ubuntu 1804)
  2. emulator (linux emulator) (main script)
  3. android version - currently 7.0! (avd repack) - (main script and config file) - odd this does most of what #2 script does. Also seems to depend on xvfb, not sure if X is required, or maybe some kind of graphics device or video/audio in/out.
  4. android system (task) - (main script and config file - same as step #3 config file)

I am not sure if there are other variables to consider. In general, we should upgrade everything that we can. We probably don't have a good 24.04 image to use, so that would need to wait. I am not sure if updating the emulator is as easy as updating the config file?

either way this will probably take a week of hacking to consider upgrading the emulator and toolchain, that might depend on newer kernel features than we have in ubuntu 18.04. I know there is a mostly ready 24.04 x11 image ready- it had a few quirks, that is a Q2 thing for me. If that works out, upgrading this would be Q3 at the earliest.

I have never played with the emulator or toolchain, :aerickson is the only one I know of who has done some of that, although not a lot. For Q1, I am pretty sure he is already overscheduled with other external deadlines due to contract renewals, etc.

Flags: needinfo?(jmaher)

It looks like the situation got way better over the last couple of months. I wonder if there have been made some changes to the Android infrastructure, which could have helped to reduce the number of the failures.

Most likely the same underlying issue as bug 1981210 which was just fixed for Android 14 specifically. Given that there are no more test failures reported here lets mark it as duplicate.

Status: NEW → RESOLVED
Closed: 4 months ago
Duplicate of bug: 1981210
Resolution: --- → DUPLICATE

Actually bug 1981210 has a fix for geckodriver so lets mark it as dependent and incomplete for now so that we can reopen the bug once failures are coming in again.

Depends on: 1981210
No longer duplicate of bug: 1981210
Resolution: DUPLICATE → INCOMPLETE
See Also: 1981210
You need to log in before you can comment on or make changes to this bug.