TcpTransport:receive runs into a timeout of 360s even if the socket was shutdown immediately
Categories
(Testing :: Marionette Client and Harness, defect)
Tracking
(firefox-esr115 fixed, firefox116 fixed, firefox117 fixed)
People
(Reporter: whimboo, Assigned: whimboo)
References
Details
Attachments
(1 file)
This is an issue that has been seen on bug 1828902. See more details in comments 9 - 11. I'll quote what Valentin said last on that given bug:
(In reply to Valentin Gosu [:valentin] (he/him) {{ PTO 3 July - 6 August }} from comment #11)
I was actually able to reproduce it locally on my win10 machine and confirmed that the test was still hanging after the firefox process had ended.
I think it may be some odd windows quirk that a socket to a recently open port does not return connection refused - or it may be some actual necko bug. I think adding a shorter socket timeout is definitely a big win in terms of CI efficiency.
If you can summarize how GetCurrentURL works in marionette, I can look into whether there's some underlying necko issue here (feel free to file a separate bug for that)
The problem here is most likely not tight to this particular command in Marionette, but can happen to each and every command. It's just that this particular content process crash test is using this command to check that it gets an invalid browsing context reference (window reference).
The server socket in Marionette is wrapped here:
https://searchfox.org/mozilla-central/source/remote/marionette/server.sys.mjs
The onPacket
handler is then responsible to handle incoming messages and to forward valid commands to the Driver instance (driver.sys.mjs).
In case of WebDriver:GetCurrentURL
we only return the value of the currently selected browsing context which is browsingContext.currentURI.spec
. And that happens in the parent process.
The hang that happens here is as long as the socket timeout that is 360s by default for the Marionette client. Over on bug 1828902 I'm going to reduce the timeout to just 10s, which is totally enough for this particular use case. But it would be good to know why in some cases - so far only on Windows - this hang happens.
Valentin, please let me know if you need more information. I'm happy to provide all the missing details. Also please feel free to move the bug to the right component in case that's the wrong one. Thanks!
Assignee | ||
Updated•2 years ago
|
Comment 1•2 years ago
|
||
So, I confirmed that the gecko server sockets work properly.
Creating a server socket then crashing the process works as expected, and the client side of the TCP connection gets closed immediately.
I then did a wireshark trace of localhost port 2828 while running the marionette test previously to bug 1828902 landing, and confirmed that the RST flag is set on the TCP connection, meaning that the connection does get killed.
That meant the issue must be with the marionette TCP client.
This led me to this code:
https://searchfox.org/mozilla-central/rev/3b707c8fd7e978eebf24279ee51ccf07895cfbcb/testing/marionette/client/marionette_driver/transport.py#184-185,216-225
def receive(self, unmarshal=True):
"""Wait for the next complete response from the remote.
...
while recv_bytes > 0:
if timeout_time is not None and time.time() > timeout_time:
raise socket.timeout(
"Connection timed out after {}s".format(self.socket_timeout)
)
try:
chunk = sock.recv(recv_bytes)
except OSError:
continue
That means when sock.recv fails due to the server shutting down, we loop forever in this while (until a timeout).
Removing the continue made the test pass immediately. Not sure if doing that would be breaking something else though.
I'm not sure what the reasons behind this decision continue
when recv
fails were, but I think the timeout fix in bug 1828902 is just patching over the root cause.
Assignee | ||
Comment 2•2 years ago
|
||
Oh that is a great find! Not sure why we haven't seen that ourselves. That behavior is indeed problematic and could be the reason we see quite a lot of timeouts during shutdown/restart tests. I'll make this modification and run all the unit tests to see if it makes a difference. It basically shouldn't but who knows...
It's still interesting that this issue is only present on Windows in TaskCluster but not locally for me. Maybe some specific Windows version or update is necessary to be installed.
Here is a try build:
https://treeherder.mozilla.org/jobs?repo=try&revision=23b8c56306759fa210d4379680c4f9be137a5b74
Assignee | ||
Comment 3•2 years ago
|
||
Updated•2 years ago
|
Comment 5•2 years ago
|
||
bugherder |
Updated•2 years ago
|
Comment 7•2 years ago
|
||
The patch landed in nightly and beta is affected.
:whimboo, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- If no, please set
status-firefox116
towontfix
.
For more information, please visit BugBot documentation.
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Description
•