Closed Bug 1840701 Opened 2 years ago Closed 2 years ago

TcpTransport:receive runs into a timeout of 360s even if the socket was shutdown immediately

Categories

(Testing :: Marionette Client and Harness, defect)

defect

Tracking

(firefox-esr115 fixed, firefox116 fixed, firefox117 fixed)

RESOLVED FIXED
117 Branch
Tracking Status
firefox-esr115 --- fixed
firefox116 --- fixed
firefox117 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

Attachments

(1 file)

This is an issue that has been seen on bug 1828902. See more details in comments 9 - 11. I'll quote what Valentin said last on that given bug:

(In reply to Valentin Gosu [:valentin] (he/him) {{ PTO 3 July - 6 August }} from comment #11)

I was actually able to reproduce it locally on my win10 machine and confirmed that the test was still hanging after the firefox process had ended.
I think it may be some odd windows quirk that a socket to a recently open port does not return connection refused - or it may be some actual necko bug. I think adding a shorter socket timeout is definitely a big win in terms of CI efficiency.
If you can summarize how GetCurrentURL works in marionette, I can look into whether there's some underlying necko issue here (feel free to file a separate bug for that)

The problem here is most likely not tight to this particular command in Marionette, but can happen to each and every command. It's just that this particular content process crash test is using this command to check that it gets an invalid browsing context reference (window reference).

The server socket in Marionette is wrapped here:
https://searchfox.org/mozilla-central/source/remote/marionette/server.sys.mjs

The onPacket handler is then responsible to handle incoming messages and to forward valid commands to the Driver instance (driver.sys.mjs).

In case of WebDriver:GetCurrentURL we only return the value of the currently selected browsing context which is browsingContext.currentURI.spec. And that happens in the parent process.

The hang that happens here is as long as the socket timeout that is 360s by default for the Marionette client. Over on bug 1828902 I'm going to reduce the timeout to just 10s, which is totally enough for this particular use case. But it would be good to know why in some cases - so far only on Windows - this hang happens.

Valentin, please let me know if you need more information. I'm happy to provide all the missing details. Also please feel free to move the bug to the right component in case that's the wrong one. Thanks!

Flags: needinfo?(valentin.gosu)

So, I confirmed that the gecko server sockets work properly.
Creating a server socket then crashing the process works as expected, and the client side of the TCP connection gets closed immediately.

I then did a wireshark trace of localhost port 2828 while running the marionette test previously to bug 1828902 landing, and confirmed that the RST flag is set on the TCP connection, meaning that the connection does get killed.
That meant the issue must be with the marionette TCP client.

This led me to this code:
https://searchfox.org/mozilla-central/rev/3b707c8fd7e978eebf24279ee51ccf07895cfbcb/testing/marionette/client/marionette_driver/transport.py#184-185,216-225

def receive(self, unmarshal=True):
    """Wait for the next complete response from the remote.
...
        while recv_bytes > 0:
            if timeout_time is not None and time.time() > timeout_time:
                raise socket.timeout(
                    "Connection timed out after {}s".format(self.socket_timeout)
                )

            try:
                chunk = sock.recv(recv_bytes)
            except OSError:
                continue

That means when sock.recv fails due to the server shutting down, we loop forever in this while (until a timeout).
Removing the continue made the test pass immediately. Not sure if doing that would be breaking something else though.

I'm not sure what the reasons behind this decision continue when recv fails were, but I think the timeout fix in bug 1828902 is just patching over the root cause.

Component: Networking → Marionette Client and Harness
Flags: needinfo?(valentin.gosu) → needinfo?(hskupin)
Product: Core → Testing

Oh that is a great find! Not sure why we haven't seen that ourselves. That behavior is indeed problematic and could be the reason we see quite a lot of timeouts during shutdown/restart tests. I'll make this modification and run all the unit tests to see if it makes a difference. It basically shouldn't but who knows...

It's still interesting that this issue is only present on Windows in TaskCluster but not locally for me. Maybe some specific Windows version or update is necessary to be installed.

Here is a try build:
https://treeherder.mozilla.org/jobs?repo=try&revision=23b8c56306759fa210d4379680c4f9be137a5b74

Flags: needinfo?(hskupin)
OS: Windows → All
Summary: On Windows the nsIServerSocket seems to hold resources so that clients do not see a "connection refused" → TcpTransport:receive runs into a timeout of 360s even if the socket was shutdown immediately
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Blocks: 1828902
Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c92a7ee14127 [marionette] TcpTransport:receive needs to return immediately when socket was closed. r=webdriver-reviewers,jgraham
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 117 Branch
Duplicate of this bug: 1828902

The patch landed in nightly and beta is affected.
:whimboo, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox116 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(hskupin)
Flags: needinfo?(hskupin)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: