Closed Bug 1840701 Opened 2 years ago Closed 2 years ago

TcpTransport:receive runs into a timeout of 360s even if the socket was shutdown immediately

Tracking

(firefox-esr115 fixed, firefox116 fixed, firefox117 fixed)

Status:

RESOLVED FIXED

Milestone:

117 Branch

Tracking Flags:

Tracking

Status

firefox-esr115

---

fixed

firefox116

---

fixed

firefox117

---

fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

Attachments

(1 file)

Bug 1840701 - [marionette] TcpTransport:receive needs to return immediately when socket was closed. 2 years ago Henrik Skupin [:whimboo][⌚️UTC+2] 48 bytes, text/x-phabricator-request		Details \| Review

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Description

•

2 years ago

This is an issue that has been seen on bug 1828902. See more details in comments 9 - 11. I'll quote what Valentin said last on that given bug:

(In reply to Valentin Gosu [:valentin] (he/him) {{ PTO 3 July - 6 August }} from comment #11)

I was actually able to reproduce it locally on my win10 machine and confirmed that the test was still hanging after the firefox process had ended.
I think it may be some odd windows quirk that a socket to a recently open port does not return connection refused - or it may be some actual necko bug. I think adding a shorter socket timeout is definitely a big win in terms of CI efficiency.
If you can summarize how GetCurrentURL works in marionette, I can look into whether there's some underlying necko issue here (feel free to file a separate bug for that)

The problem here is most likely not tight to this particular command in Marionette, but can happen to each and every command. It's just that this particular content process crash test is using this command to check that it gets an invalid browsing context reference (window reference).

The server socket in Marionette is wrapped here:
https://searchfox.org/mozilla-central/source/remote/marionette/server.sys.mjs

The onPacket handler is then responsible to handle incoming messages and to forward valid commands to the Driver instance (driver.sys.mjs).

In case of WebDriver:GetCurrentURL we only return the value of the currently selected browsing context which is browsingContext.currentURI.spec. And that happens in the parent process.

The hang that happens here is as long as the socket timeout that is 360s by default for the Marionette client. Over on bug 1828902 I'm going to reduce the timeout to just 10s, which is totally enough for this particular use case. But it would be good to know why in some cases - so far only on Windows - this hang happens.

Valentin, please let me know if you need more information. I'm happy to provide all the missing details. Also please feel free to move the bug to the right component in case that's the wrong one. Thanks!

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

2 years ago

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin] (he/him)

Comment 1

•

2 years ago

So, I confirmed that the gecko server sockets work properly.
Creating a server socket then crashing the process works as expected, and the client side of the TCP connection gets closed immediately.

I then did a wireshark trace of localhost port 2828 while running the marionette test previously to bug 1828902 landing, and confirmed that the RST flag is set on the TCP connection, meaning that the connection does get killed.
That meant the issue must be with the marionette TCP client.

This led me to this code:
https://searchfox.org/mozilla-central/rev/3b707c8fd7e978eebf24279ee51ccf07895cfbcb/testing/marionette/client/marionette_driver/transport.py#184-185,216-225

def receive(self, unmarshal=True):
    """Wait for the next complete response from the remote.
...
        while recv_bytes > 0:
            if timeout_time is not None and time.time() > timeout_time:
                raise socket.timeout(
                    "Connection timed out after {}s".format(self.socket_timeout)
                )

            try:
                chunk = sock.recv(recv_bytes)
            except OSError:
                continue

That means when sock.recv fails due to the server shutting down, we loop forever in this while (until a timeout).
Removing the continue made the test pass immediately. Not sure if doing that would be breaking something else though.

I'm not sure what the reasons behind this decision continue when recv fails were, but I think the timeout fix in bug 1828902 is just patching over the root cause.

Component: Networking → Marionette Client and Harness

Flags: needinfo?(valentin.gosu) → needinfo?(hskupin)

Product: Core → Testing

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 2

•

2 years ago

Oh that is a great find! Not sure why we haven't seen that ourselves. That behavior is indeed problematic and could be the reason we see quite a lot of timeouts during shutdown/restart tests. I'll make this modification and run all the unit tests to see if it makes a difference. It basically shouldn't but who knows...

It's still interesting that this issue is only present on Windows in TaskCluster but not locally for me. Maybe some specific Windows version or update is necessary to be installed.

Here is a try build:
https://treeherder.mozilla.org/jobs?repo=try&revision=23b8c56306759fa210d4379680c4f9be137a5b74

Flags: needinfo?(hskupin)

OS: Windows → All

Summary: On Windows the nsIServerSocket seems to hold resources so that clients do not see a "connection refused" → TcpTransport:receive runs into a timeout of 360s even if the socket was shutdown immediately

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 3

•

2 years ago

Attached file Bug 1840701 - [marionette] TcpTransport:receive needs to return immediately when socket was closed. — Details

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → hskupin

Status: NEW → ASSIGNED

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

2 years ago

Blocks: 1828902

Pulsebot

Comment 4

•

2 years ago

Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c92a7ee14127 [marionette] TcpTransport:receive needs to return immediately when socket was closed. r=webdriver-reviewers,jgraham

Iulian Moraru

Comment 5

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c92a7ee14127

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

status-firefox117: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 117 Branch

Cosmin Sabou [:CosminS]

Updated

•

2 years ago

Duplicate of this bug: 1828902

Ryan VanderMeulen [:RyanVM]

Updated

•

2 years ago

status-firefox116: --- → affected

status-firefox-esr115: --- → affected

BugBot [:suhaib / :marco/ :calixte]

Comment 7

•

2 years ago

The patch landed in nightly and beta is affected.
:whimboo, is this bug important enough to require an uplift?

If yes, please nominate the patch for beta approval.
If no, please set status-firefox116 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(hskupin)

Pulsebot

Comment 8

•

2 years ago

uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/a155fa3223a5

Dianna Smith [:diannaS]

Updated

•

2 years ago

status-firefox116: affected → fixed

Pulsebot

Comment 9

•

2 years ago

uplift

https://hg.mozilla.org/releases/mozilla-esr115/rev/54929bac22f0

Dianna Smith [:diannaS]

Updated

•

2 years ago

status-firefox-esr115: affected → fixed

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

2 years ago

Flags: needinfo?(hskupin)

You need to log in before you can comment on or make changes to this bug.

Bugzilla

TcpTransport:receive runs into a timeout of 360s even if the socket was shutdown immediately

Categories

(Testing :: Marionette Client and Harness, defect)

Tracking

(firefox-esr115 fixed, firefox116 fixed, firefox117 fixed)

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Updated

Updated

Comment 4

Comment 5

Updated

Updated

Comment 7

Comment 8

Updated

Comment 9

Updated

Updated

Attachment

General

Description

File Name

Content Type