Closed
Bug 1400819
Opened 7 years ago
Closed 6 years ago
Race in Marionette client when attempting to initially connect to the socket
Categories
(Remote Protocol :: Marionette, defect, P3)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: whimboo, Unassigned)
References
Details
(Keywords: pi-marionette-client)
Not sure yet what could cause this problem, but the patch on bug 1394381 might help which adds logging to Marionette driver.
Reporter | ||
Comment 1•7 years ago
|
||
Here an example:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=130322017&lineNumber=36972
[task 2017-09-12T10:35:19.566584Z] 10:35:19 INFO - 1505212519561 Marionette INFO Enabled via MOZ_MARIONETTE
[task 2017-09-12T10:35:19.566705Z] 10:35:19 INFO - 1505212519561 Marionette DEBUG Received observer notification "profile-after-change"
[task 2017-09-12T10:35:19.655130Z] 10:35:19 INFO - 1505212519648 Marionette DEBUG Received observer notification "command-line-startup"
[task 2017-09-12T10:35:19.711533Z] 10:35:19 INFO - 1505212519705 Marionette DEBUG Received observer notification "domwindowopened"
[task 2017-09-12T10:35:19.788149Z] 10:35:19 INFO - 1505212519776 Marionette DEBUG Safe Mode detected. Going to suspress the dialog now.
[task 2017-09-12T10:35:20.479621Z] 10:35:20 INFO - GLib-GIO-Message: Using the 'memory' GSettings backend. Your settings will not be saved or shared with other applications.
[task 2017-09-12T10:35:20.709397Z] 10:35:20 INFO - 1505212520705 Marionette DEBUG Received observer notification "sessionstore-windows-restored"
[task 2017-09-12T10:35:21.761607Z] 10:35:21 INFO - JavaScript error: jar:file:///builds/worker/workspace/build/application/firefox/omni.ja!/components/nsUrlClassifierListManager.js, line 69: NS_ERROR_XPC_GS_RETURNED_FAILURE: Component returned failure code: 0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE) [nsIJSCID.getService]
[task 2017-09-12T10:35:21.762774Z] 10:35:21 INFO - JavaScript error: jar:file:///builds/worker/workspace/build/application/firefox/omni.ja!/components/nsUrlClassifierListManager.js, line 69: NS_ERROR_XPC_GS_RETURNED_FAILURE: Component returned failure code: 0x80570016 (NS_ERROR_XPC_GS_RETURNED_FAILURE) [nsIJSCID.getService]
[task 2017-09-12T10:35:21.938982Z] 10:35:21 INFO - 1505212521935 Marionette INFO Listening on port 2828
[task 2017-09-12T10:37:19.512483Z] 10:37:19 INFO - TEST-UNEXPECTED-ERROR | test_quit_restart.py TestQuitRestart.test_in_app_restart_safe_mode | IOError: Process has been unexpectedly closed (Exit code: -15) (Reason: [Errno 111] Connection refused)
This is a case for the safe browsing mode, where we clearly have a running Marionette server, but no connection was established.
Reporter | ||
Comment 2•7 years ago
|
||
This is still an issue which triggers a lot of intermittents as it looks like. Bumping priority to P3.
Keywords: ateam-marionette-client
Priority: -- → P3
Reporter | ||
Comment 3•7 years ago
|
||
My upcoming patch for bug 1410366 should fix that.
Depends on: 1410366
Reporter | ||
Comment 4•7 years ago
|
||
As noticed on bug 1410366 there are cases when the client of a just connected socket doesn't receive the hello string. I don't know why that happens, and cannot reproduce it, except for Fennec (Android emulator) which doesn't offer us a great way to actually log content. :/
Reporter | ||
Comment 5•7 years ago
|
||
Additional to comment 4 we currently wait 360s for the hello string to appear! This is way to much and happens due to a bug in transport.py. I will take care of it via bug 1413852.
Reporter | ||
Comment 6•7 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from bug 1397201 comment #82)
> The timeouts have been changed. But lets keep the work happening on this bug
> and not bug 1414495.
>
> (In reply to Geoff Brown [:gbrown] from comment #1)
> > I added some logging in
> > https://treeherder.mozilla.org/#/
> > jobs?repo=try&revision=6737b2be84e9b7cd23fcb1dcf4586abab865498a and
> > reproduced the timeouts. In the timeout cases, it looks to me like
> > start_session() completes very quickly -- maybe even before the server has
> > started?? I wonder if it is actually connecting to something (what?) or if
> > it is just failing but not noticing that it's failing.
> >
> > I then added more logging and pushed to try, but then couldn't reproduce any
> > time outs on startup -- just as jmaher reported in
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1397201#c79.
>
> 18:46:18 INFO - creating Marionette...
> 18:46:18 INFO - starting session...
> 18:46:18 INFO - 2017-11-03 18:46:18.668000 start_session: starts
> 18:46:18 INFO - 2017-11-03 18:46:18.673000 start_session: after
> raise_for_port
> 18:46:18 INFO - 2017-11-03 18:46:18.675000 start_session: after connect
> 18:46:18 INFO - 2017-11-03 18:46:18.675000 start_session: calls
> send_message...
> 18:46:18 INFO - 2017-11-03 18:46:18.698000 start_session: after
> send_message
> 18:46:18 INFO - started session
> [..]
> 18:46:21 INFO - GECKO(1896) | 1509734781926 Marionette DEBUG Received
> observer notification "sessionstore-windows-restored"
> 18:46:22 INFO - GECKO(1896) | 1509734782546 Marionette DEBUG New
> connections are accepted
> 18:46:22 INFO - GECKO(1896) | 1509734782546 Marionette INFO Listening
> on port 2828
>
> Basically Marionette client should not be able to connect to the server
> socket unless it is actively listening on port 2828, which in this case
> happened 4s after the connection attempt. So why did that pass? It could
> only mean that there was another Firefox instance running occupying this
> port. But it must have been released quickly, because otherwise the socket
> server of the wanted Firefox instance would have been failed.
Assignee: nobody → hskupin
Blocks: 1397201
Status: NEW → ASSIGNED
Priority: P3 → P1
Summary: Sometimes Marionette driver seems to silently stop waiting for a port → Race in Marionette client when attempting to initially connect to the socket.
Reporter | ||
Comment 7•7 years ago
|
||
I started a try build with a lot of logging enabled. Lets see if we can catch such a hang now.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fe052ba1922a2a20b9607472b4c68e37bf390c18
Reporter | ||
Comment 8•7 years ago
|
||
I missed to add a locally introduced file and as such the last try is busted. Here a new try build:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=29b9eeb4fba4e6d955fa3e86792c95c2220475b7
Reporter | ||
Comment 9•7 years ago
|
||
I finally hit such a failure on try but not for mochitests or reftests, but for Marionette headless:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=29b9eeb4fba4e6d955fa3e86792c95c2220475b7&selectedJob=142699963
The important lines from the log are:
https://treeherder.mozilla.org/logviewer.html#?job_id=142699963&repo=try&lineNumber=896-913
12:20:14 INFO - Application command: Z:\task_1510055025\build\application\firefox\firefox.exe -no-remote -marionette -profile c:\users\genericworker\appdata\local\temp\tmpsmwsdf.mozrunner
12:20:14 INFO - marionette.driver Connected to Marionette on localhost:2828
12:20:14 INFO - Profile path is c:\users\genericworker\appdata\local\temp\tmpsmwsdf.mozrunner
12:20:14 INFO - Starting fixture servers
12:20:14 INFO - *** receive: '50:{"appli'
12:20:14 INFO - ** receive: reading '{"appli' with '50' bytes
12:20:14 INFO - *** receive: 'cationType":"gecko","marionetteProtocol":3}'
12:20:14 INFO - ** receive: reading '{"applicationType":"gecko","marionetteProtocol":3}' with '50' bytes
12:20:14 INFO - marionette.driver -> 24:[0, 1, "newSession", {}]
12:20:14 INFO - ** send: sent '24:[0, 1, "newSession", {}]' with 27 bytes
12:20:14 INFO - *** You are running in headless mode.
12:20:14 INFO - [7588, Main Thread] WARNING: Failed to load startupcache file correctly, removing!: file z:/build/build/src/startupcache/StartupCache.cpp, line 217
12:20:15 INFO - 1510057215404 Marionette DEBUG Received observer notification "profile-after-change"
12:20:15 INFO - [7588, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file z:/build/build/src/extensions/cookie/nsPermissionManager.cpp, line 2871
12:20:15 INFO - ++DOCSHELL 0000022154C32000 == 1 [pid = 7588] [id = {3ecdd5a4-0a15-4623-a8b7-bb1171b98281}]
12:20:15 INFO - ++DOMWINDOW == 1 (000002215FDD4800) [pid = 7588] [serial = 1] [outer = 0000000000000000]
12:20:15 INFO - ++DOMWINDOW == 2 (000002215FDD5800) [pid = 7588] [serial = 2] [outer = 000002215FDD4800]
12:20:15 INFO - 1510057215522 Marionette DEBUG Received observer notification "command-line-startup"
12:20:15 INFO - 1510057215522 Marionette INFO Enabled via --marionette
[..]
12:20:18 INFO - 1510057218239 Marionette DEBUG New connections are accepted
12:20:18 INFO - 1510057218239 Marionette INFO Listening on port 2828
So right after Marionette client started Firefox it can connect to an instance of Firefox. But what kind of instance is that? Definitely not the one Marionette just started, because it takes 4 more seconds until the server socket is ready, and would accept any connection.
As result the client sends the new session command to that unknown instance of Firefox, but then never got a reply. What I wonder is why `_send_message` doesn't return here. As it looks like the newly spawned instance of Firefox can start the socket server at port 2828, so the other instance has released it. Marionette client's socket should get closed, and it should not hang.
Reporter | ||
Comment 10•7 years ago
|
||
By checking the history of the specific worker this task was running on I found that a previous test job did not correctly clean-up the worker. This indeed could have been caused this problem.
I will see if I can make this reproducible locally. At least I have a starting point now.
Comment 11•7 years ago
|
||
Not being actively worked on and not part of this quarter’s goals.
Priority: P1 → P3
Summary: Race in Marionette client when attempting to initially connect to the socket. → Race in Marionette client when attempting to initially connect to the socket
Reporter | ||
Comment 12•7 years ago
|
||
Just checked the first comment again and noticed that it came from a test which restarted Firefox. So there is a chance that we accidentally connected to the old process before it was gone. Lets wait for bug 1433873 first.
Reporter | ||
Comment 13•6 years ago
|
||
The affected test was actually disabled via bug 1397612. We should re-enable it first to see if it may work now.
Depends on: 1397612
Reporter | ||
Comment 14•6 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #13)
> The affected test was actually disabled via bug 1397612. We should re-enable
> it first to see if it may work now.
This test was re-enabled now. Lets see if we see this again, or if it is finally gone. I will re-check next week or so.
Flags: needinfo?(hskupin)
Reporter | ||
Comment 15•6 years ago
|
||
So far we haven't experienced the problem again. I would call it WFM for now by most likely bug 1433873.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(hskupin)
Resolution: --- → WORKSFORME
Updated•2 years ago
|
Product: Testing → Remote Protocol
You need to log in
before you can comment on or make changes to this bug.
Description
•