Closed Bug 1828902 Opened 2 years ago Closed 1 year ago

On Windows the test "test_crash.py TestCrash.test_crash_content_process" runs about 360s

Categories

(Testing :: Marionette Client and Harness, defect, P2)

Default
defect
Points:
2

Tracking

(firefox-esr115 wontfix, firefox116 fixed)

RESOLVED DUPLICATE of bug 1840701
116 Branch
Tracking Status
firefox-esr115 --- wontfix
firefox116 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

(Whiteboard: [webdriver:m7])

Attachments

(1 file)

I've noticed locally that the test test_crash.py TestCrash.test_crash_content_process runs for about 360s on Windows but only 4s on Linux and MacOS.

https://treeherder.mozilla.org/logviewer?job_id=412932680&repo=mozilla-central&lineNumber=15361-15364

[task 2023-04-18T23:43:26.806Z] 23:43:26     INFO -  TEST-START | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrash.test_crash_content_process
[task 2023-04-18T23:43:26.806Z] 23:43:26     INFO -  Suppressing GECKO output. To view, add `print(self.new_out.getvalue())` to the end of this test.
[task 2023-04-18T23:49:27.175Z] 23:49:27     INFO -  Application command: Z:\task_168185948858909\build\application\firefox\firefox.exe -no-remote -marionette --wait-for-browser -profile C:\Users\task_168185948858909\AppData\Local\Temp\tmpzfa4vyk4.mozrunner
[task 2023-04-18T23:49:29.088Z] 23:49:29     INFO -  TEST-PASS | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrash.test_crash_content_process | took 362289ms

It can be fixed by setting the socket timeout to 10s as well for the get_url() call so that there is no hang in sock.recv() anymore. We should investigate possible solutions to make the Mn test job faster.

The severity field is not set for this bug.
:whimboo, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(hskupin)
Points: --- → 2
Flags: needinfo?(hskupin)
Priority: -- → P2
Severity: -- → S3

I'm not able to replicate this long duration of the test locally. As such I enabled the logging output for crash tests and pushed to try:
https://treeherder.mozilla.org/jobs?repo=try&revision=515a50c903f539fd7fe4cb21b6dcbc17aaed25fa

The following log lines show such a delay which actually happens during late shutdown of Firefox:
https://treeherder.mozilla.org/logviewer?job_id=420146472&repo=try&lineNumber=1534-1538

What's interesting is the delay at the very end before the browser shuts down and when we start the new Firefox process:

[task 2023-06-21T14:53:08.096Z] 14:53:08     INFO -  DEBUG: Starting phase web-workers-shutdown
[task 2023-06-21T14:53:08.096Z] 14:53:08     INFO -  DEBUG: Spinning the event loop
[task 2023-06-21T14:53:08.096Z] 14:53:08     INFO -  DEBUG: Finished phase web-workers-shutdown
[task 2023-06-21T14:59:08.004Z] 14:59:08     INFO -  Application command: Z:\task_168735841901231\build\application\firefox\firefox.exe -no-remote -marionette --wait-for-browser -profile C:\Users\task_168735841901231\AppData\Local\Temp\tmpx8f7ppfm.mozrunner
[task 2023-06-21T14:59:08.145Z] 14:59:08     INFO -  DEBUG: Adding blocker AddonManager: shutting down. for phase profile-before-change

I wonder if this might be caused by the Marionette test, harness or client. So I'll add some more print calls to the Python code.

Ok, so the problem here is indeed a hang when sending the Marionette command and the server side is not responding anymore:
https://treeherder.mozilla.org/logviewer?job_id=420277166&repo=try&lineNumber=1615-1714

We are sending WebDriver:GetCurrentURL at 12:43:27 with in the call to Wait.until() but that doesn't return until 12:49:27 when the log prints *** after wait. These are exactly 360s which is the default socket timeout for Marionette.

It's not clear to me why we do not see this problem on other platforms or on my local Windows machine. Maybe the Socket is in some cases not directly closed but remains in a state where it doesn't see incoming data anymore, and as such cannot send any reply?

I'll create a new try build with nsSocketTransport:5 logging in Firefox enabled. Maybe this gives an indication what's wrong.

My suspicion was probably right. The socket gets shutdown but then we end-up hanging:
https://treeherder.mozilla.org/logviewer?job_id=420328660&repo=try&lineNumber=4672-4673

Valentin, what should happen when a client has sent a message over the socket right before the socket is shutting down but no response has been sent out? Shouldn't the socket connection drop and the client gets a disconnected error? Maybe that's not working correctly in this case? I would appreciate if you could have a look at the log and if there is something visible. Thanks!

Flags: needinfo?(valentin.gosu)

For now I pushed a new try build which reduces the socket timeout to 10s when calling a Marionette command while we know that the content process can crash every moment. Lets see if that works.

Assignee: nobody → hskupin
Status: NEW → ASSIGNED

With the attached patch we can reduce the default socket timeout from 360s to just 10s which means that we save around 6 minutes per test job. Per central merge and on release branches we run 8 jobs per push, which frees up roughly 47 minutes of machine time.

Whiteboard: [webdriver:m7]

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #5)

Valentin, what should happen when a client has sent a message over the socket right before the socket is shutting down but no response has been sent out? Shouldn't the socket connection drop and the client gets a disconnected error? Maybe that's not working correctly in this case? I would appreciate if you could have a look at the log and if there is something visible. Thanks!

Not sure if this is an issue in necko or marionette. We actually complete the shutdown here, and it seems to me like the GetCurrentURL call comes after that, so there's no response.

Flags: needinfo?(valentin.gosu)

Thanks Valentin! I actually do not have an idea here given that this is not a persistent issue but only happens on Windows in our CI. Whether other platforms nor locally I can replicate this hang. Generally we use a 360s timeout for the socket, which explains the overall delay when waiting for a response. After that timeout the Python code raises the expected socket timeout issue. But the question is still what delays the final socket closure and delays that so much on in CI. It's permanent there and can be seen for each and every job.

I was actually able to reproduce it locally on my win10 machine and confirmed that the test was still hanging after the firefox process had ended.
I think it may be some odd windows quirk that a socket to a recently open port does not return connection refused - or it may be some actual necko bug. I think adding a shorter socket timeout is definitely a big win in terms of CI efficiency.
If you can summarize how GetCurrentURL works in marionette, I can look into whether there's some underlying necko issue here (feel free to file a separate bug for that)

See Also: → 1840701

Thanks Valentin! I filed bug 1840701 for further investigation. So this bug will just cover the reduced socket timeout change.

Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c9443abad96a [marionette] Make test "test_crash.py TestCrash.test_crash_content_process" faster on Windows. r=webdriver-reviewers,jdescottes
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 116 Branch

Given that we have to run those tests on 115esr for quite some time I would suggest that we get this test-only patch uplifted. Beta is not necessary given that the merge will happen early next week anyway. I'll request uplift for 115esr later this week.

As it looks like Valentin found the real underlying problem on bug 1840701. As such I will not request an uplift here but actually get the correct fix landed and uplifted via that other bug. I'm also considering to revert this patch if it works well without.

Depends on: 1840701

Sheriffs, can we please get this patch backed out? With the fix on bug 1840701 it's no longer necessary. Once done we should dupe this bug to bug 1840701. Thanks.

Flags: needinfo?(sheriffs)

Backed out changeset c9443abad96a (bug 1828902) as requested by whimboo for no longer being necessary.

Backout link: https://hg.mozilla.org/integration/autoland/rev/ef7d9ea0c3f751e7c525a136923c9c404c297b1f

Will be merged to central later on.

Duplicate of bug: 1840701
Flags: needinfo?(sheriffs)
Resolution: FIXED → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: