Closed Bug 1400819 Opened 4 years ago Closed 3 years ago

Race in Marionette client when attempting to initially connect to the socket

Categories

(Testing :: Marionette, defect, P3)

57 Branch
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: whimboo, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: pi-marionette-client)

Not sure yet what could cause this problem, but the patch on bug 1394381 might help which adds logging to Marionette driver.
Here an example:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=130322017&lineNumber=36972

[task 2017-09-12T10:35:19.566584Z] 10:35:19     INFO -  1505212519561	Marionette	INFO	Enabled via MOZ_MARIONETTE
[task 2017-09-12T10:35:19.566705Z] 10:35:19     INFO -  1505212519561	Marionette	DEBUG	Received observer notification "profile-after-change"
[task 2017-09-12T10:35:19.655130Z] 10:35:19     INFO -  1505212519648	Marionette	DEBUG	Received observer notification "command-line-startup"
[task 2017-09-12T10:35:19.711533Z] 10:35:19     INFO -  1505212519705	Marionette	DEBUG	Received observer notification "domwindowopened"
[task 2017-09-12T10:35:19.788149Z] 10:35:19     INFO -  1505212519776	Marionette	DEBUG	Safe Mode detected. Going to suspress the dialog now.
[task 2017-09-12T10:35:20.479621Z] 10:35:20     INFO -  GLib-GIO-Message: Using the 'memory' GSettings backend.  Your settings will not be saved or shared with other applications.
[task 2017-09-12T10:35:20.709397Z] 10:35:20     INFO -  1505212520705	Marionette	DEBUG	Received observer notification "sessionstore-windows-restored"
[task 2017-09-12T10:35:21.761607Z] 10:35:21     INFO -  JavaScript error: jar:file:///builds/worker/workspace/build/application/firefox/omni.ja!/components/nsUrlClassifierListManager.js, line 69: NS_ERROR_XPC_GS_RETURNED_FAILURE: Component returned failure code: 0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE) [nsIJSCID.getService]
[task 2017-09-12T10:35:21.762774Z] 10:35:21     INFO -  JavaScript error: jar:file:///builds/worker/workspace/build/application/firefox/omni.ja!/components/nsUrlClassifierListManager.js, line 69: NS_ERROR_XPC_GS_RETURNED_FAILURE: Component returned failure code: 0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE) [nsIJSCID.getService]
[task 2017-09-12T10:35:21.938982Z] 10:35:21     INFO -  1505212521935	Marionette	INFO	Listening on port 2828
[task 2017-09-12T10:37:19.512483Z] 10:37:19     INFO - TEST-UNEXPECTED-ERROR | test_quit_restart.py TestQuitRestart.test_in_app_restart_safe_mode | IOError: Process has been unexpectedly closed (Exit code: -15) (Reason: [Errno 111] Connection refused)

This is a case for the safe browsing mode, where we clearly have a running Marionette server, but no connection was established.
This is still an issue which triggers a lot of intermittents as it looks like. Bumping priority to P3.
Priority: -- → P3
My upcoming patch for bug 1410366 should fix that.
Depends on: 1410366
As noticed on bug 1410366 there are cases when the client of a just connected socket doesn't receive the hello string. I don't know why that happens, and cannot reproduce it, except for Fennec (Android emulator) which doesn't offer us a great way to actually log content. :/
Depends on: 1413852
Additional to comment 4 we currently wait 360s for the hello string to appear! This is way to much and happens due to a bug in transport.py. I will take care of it via bug 1413852.
(In reply to Henrik Skupin (:whimboo) from bug 1397201 comment #82)
> The timeouts have been changed. But lets keep the work happening on this bug
> and not bug 1414495.
> 
> (In reply to Geoff Brown [:gbrown] from comment #1)
> > I added some logging in
> > https://treeherder.mozilla.org/#/
> > jobs?repo=try&revision=6737b2be84e9b7cd23fcb1dcf4586abab865498a and
> > reproduced the timeouts. In the timeout cases, it looks to me like
> > start_session() completes very quickly -- maybe even before the server has
> > started?? I wonder if it is actually connecting to something (what?) or if
> > it is just failing but not noticing that it's failing.
> > 
> > I then added more logging and pushed to try, but then couldn't reproduce any
> > time outs on startup -- just as jmaher reported in
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1397201#c79.
> 
> 18:46:18     INFO -  creating Marionette...
> 18:46:18     INFO -  starting session...
> 18:46:18     INFO -  2017-11-03 18:46:18.668000 start_session: starts
> 18:46:18     INFO -  2017-11-03 18:46:18.673000 start_session: after
> raise_for_port
> 18:46:18     INFO -  2017-11-03 18:46:18.675000 start_session: after connect
> 18:46:18     INFO -  2017-11-03 18:46:18.675000 start_session: calls
> send_message...
> 18:46:18     INFO -  2017-11-03 18:46:18.698000 start_session: after
> send_message
> 18:46:18     INFO -  started session
> [..]
> 18:46:21     INFO -  GECKO(1896) | 1509734781926	Marionette	DEBUG	Received
> observer notification "sessionstore-windows-restored"
> 18:46:22     INFO -  GECKO(1896) | 1509734782546	Marionette	DEBUG	New
> connections are accepted
> 18:46:22     INFO -  GECKO(1896) | 1509734782546	Marionette	INFO	Listening
> on port 2828
> 
> Basically Marionette client should not be able to connect to the server
> socket unless it is actively listening on port 2828, which in this case
> happened 4s after the connection attempt. So why did that pass? It could
> only mean that there was another Firefox instance running occupying this
> port. But it must have been released quickly, because otherwise the socket
> server of the wanted Firefox instance would have been failed.
Assignee: nobody → hskupin
Blocks: 1397201
Status: NEW → ASSIGNED
Priority: P3 → P1
Summary: Sometimes Marionette driver seems to silently stop waiting for a port → Race in Marionette client when attempting to initially connect to the socket.
I started a try build with a lot of logging enabled. Lets see if we can catch such a hang now.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=fe052ba1922a2a20b9607472b4c68e37bf390c18
I missed to add a locally introduced file and as such the last try is busted. Here a new try build:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=29b9eeb4fba4e6d955fa3e86792c95c2220475b7
I finally hit such a failure on try but not for mochitests or reftests, but for Marionette headless:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=29b9eeb4fba4e6d955fa3e86792c95c2220475b7&selectedJob=142699963

The important lines from the log are:

https://treeherder.mozilla.org/logviewer.html#?job_id=142699963&repo=try&lineNumber=896-913

12:20:14     INFO -  Application command: Z:\task_1510055025\build\application\firefox\firefox.exe -no-remote -marionette -profile c:\users\genericworker\appdata\local\temp\tmpsmwsdf.mozrunner
12:20:14     INFO -  marionette.driver Connected to Marionette on localhost:2828
12:20:14     INFO -  Profile path is c:\users\genericworker\appdata\local\temp\tmpsmwsdf.mozrunner
12:20:14     INFO -  Starting fixture servers
12:20:14     INFO -  *** receive: '50:{"appli'
12:20:14     INFO -  ** receive: reading '{"appli' with '50' bytes
12:20:14     INFO -  *** receive: 'cationType":"gecko","marionetteProtocol":3}'
12:20:14     INFO -  ** receive: reading '{"applicationType":"gecko","marionetteProtocol":3}' with '50' bytes
12:20:14     INFO -  marionette.driver -> 24:[0, 1, "newSession", {}]
12:20:14     INFO -  ** send: sent '24:[0, 1, "newSession", {}]' with 27 bytes
12:20:14     INFO -  *** You are running in headless mode.
12:20:14     INFO -  [7588, Main Thread] WARNING: Failed to load startupcache file correctly, removing!: file z:/build/build/src/startupcache/StartupCache.cpp, line 217
12:20:15     INFO -  1510057215404	Marionette	DEBUG	Received observer notification "profile-after-change"
12:20:15     INFO -  [7588, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file z:/build/build/src/extensions/cookie/nsPermissionManager.cpp, line 2871
12:20:15     INFO -  ++DOCSHELL 0000022154C32000 == 1 [pid = 7588] [id = {3ecdd5a4-0a15-4623-a8b7-bb1171b98281}]
12:20:15     INFO -  ++DOMWINDOW == 1 (000002215FDD4800) [pid = 7588] [serial = 1] [outer = 0000000000000000]
12:20:15     INFO -  ++DOMWINDOW == 2 (000002215FDD5800) [pid = 7588] [serial = 2] [outer = 000002215FDD4800]
12:20:15     INFO -  1510057215522	Marionette	DEBUG	Received observer notification "command-line-startup"
12:20:15     INFO -  1510057215522	Marionette	INFO	Enabled via --marionette
[..]
12:20:18     INFO -  1510057218239	Marionette	DEBUG	New connections are accepted
12:20:18     INFO -  1510057218239	Marionette	INFO	Listening on port 2828

So right after Marionette client started Firefox it can connect to an instance of Firefox. But what kind of instance is that? Definitely not the one Marionette just started, because it takes 4 more seconds until the server socket is ready, and would accept any connection.

As result the client sends the new session command to that unknown instance of Firefox, but then never got a reply. What I wonder is why `_send_message` doesn't return here. As it looks like the newly spawned instance of Firefox can start the socket server at port 2828, so the other instance has released it. Marionette client's socket should get closed, and it should not hang.
By checking the history of the specific worker this task was running on I found that a previous test job did not correctly clean-up the worker. This indeed could have been caused this problem.

I will see if I can make this reproducible locally. At least I have a starting point now.
No longer blocks: 1397612
Not being actively worked on and not part of this quarter’s goals.
Priority: P1 → P3
Summary: Race in Marionette client when attempting to initially connect to the socket. → Race in Marionette client when attempting to initially connect to the socket
Just checked the first comment again and noticed that it came from a test which restarted Firefox. So there is a chance that we accidentally connected to the old process before it was gone. Lets wait for bug 1433873 first.
Assignee: hskupin → nobody
Status: ASSIGNED → NEW
Depends on: 1433873
Depends on: 1449862
The affected test was actually disabled via bug 1397612. We should re-enable it first to see if it may work now.
Depends on: 1397612
(In reply to Henrik Skupin (:whimboo) from comment #13)
> The affected test was actually disabled via bug 1397612. We should re-enable
> it first to see if it may work now.

This test was re-enabled now. Lets see if we see this again, or if it is finally gone. I will re-check next week or so.
Flags: needinfo?(hskupin)
So far we haven't experienced the problem again. I would call it WFM for now by most likely bug 1433873.
Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(hskupin)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.