On Windows the test "test_crash.py TestCrash.test_crash_content_process" runs about 360s
Categories
(Testing :: Marionette Client and Harness, defect, P2)
Tracking
(firefox-esr115 wontfix, firefox116 fixed)
People
(Reporter: whimboo, Assigned: whimboo)
References
Details
(Whiteboard: [webdriver:m7])
Attachments
(1 file)
I've noticed locally that the test test_crash.py TestCrash.test_crash_content_process runs for about 360s on Windows but only 4s on Linux and MacOS.
[task 2023-04-18T23:43:26.806Z] 23:43:26 INFO - TEST-START | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrash.test_crash_content_process
[task 2023-04-18T23:43:26.806Z] 23:43:26 INFO - Suppressing GECKO output. To view, add `print(self.new_out.getvalue())` to the end of this test.
[task 2023-04-18T23:49:27.175Z] 23:49:27 INFO - Application command: Z:\task_168185948858909\build\application\firefox\firefox.exe -no-remote -marionette --wait-for-browser -profile C:\Users\task_168185948858909\AppData\Local\Temp\tmpzfa4vyk4.mozrunner
[task 2023-04-18T23:49:29.088Z] 23:49:29 INFO - TEST-PASS | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrash.test_crash_content_process | took 362289ms
It can be fixed by setting the socket timeout to 10s as well for the get_url()
call so that there is no hang in sock.recv()
anymore. We should investigate possible solutions to make the Mn test job faster.
Comment 1•2 years ago
|
||
The severity field is not set for this bug.
:whimboo, could you have a look please?
For more information, please visit BugBot documentation.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 2•1 year ago
•
|
||
I'm not able to replicate this long duration of the test locally. As such I enabled the logging output for crash tests and pushed to try:
https://treeherder.mozilla.org/jobs?repo=try&revision=515a50c903f539fd7fe4cb21b6dcbc17aaed25fa
Assignee | ||
Comment 3•1 year ago
|
||
The following log lines show such a delay which actually happens during late shutdown of Firefox:
https://treeherder.mozilla.org/logviewer?job_id=420146472&repo=try&lineNumber=1534-1538
What's interesting is the delay at the very end before the browser shuts down and when we start the new Firefox process:
[task 2023-06-21T14:53:08.096Z] 14:53:08 INFO - DEBUG: Starting phase web-workers-shutdown
[task 2023-06-21T14:53:08.096Z] 14:53:08 INFO - DEBUG: Spinning the event loop
[task 2023-06-21T14:53:08.096Z] 14:53:08 INFO - DEBUG: Finished phase web-workers-shutdown
[task 2023-06-21T14:59:08.004Z] 14:59:08 INFO - Application command: Z:\task_168735841901231\build\application\firefox\firefox.exe -no-remote -marionette --wait-for-browser -profile C:\Users\task_168735841901231\AppData\Local\Temp\tmpx8f7ppfm.mozrunner
[task 2023-06-21T14:59:08.145Z] 14:59:08 INFO - DEBUG: Adding blocker AddonManager: shutting down. for phase profile-before-change
I wonder if this might be caused by the Marionette test, harness or client. So I'll add some more print calls to the Python code.
Assignee | ||
Comment 4•1 year ago
|
||
Ok, so the problem here is indeed a hang when sending the Marionette command and the server side is not responding anymore:
https://treeherder.mozilla.org/logviewer?job_id=420277166&repo=try&lineNumber=1615-1714
We are sending WebDriver:GetCurrentURL
at 12:43:27
with in the call to Wait.until()
but that doesn't return until 12:49:27
when the log prints *** after wait
. These are exactly 360s which is the default socket timeout for Marionette.
It's not clear to me why we do not see this problem on other platforms or on my local Windows machine. Maybe the Socket is in some cases not directly closed but remains in a state where it doesn't see incoming data anymore, and as such cannot send any reply?
I'll create a new try build with nsSocketTransport:5
logging in Firefox enabled. Maybe this gives an indication what's wrong.
Assignee | ||
Comment 5•1 year ago
|
||
My suspicion was probably right. The socket gets shutdown but then we end-up hanging:
https://treeherder.mozilla.org/logviewer?job_id=420328660&repo=try&lineNumber=4672-4673
Valentin, what should happen when a client has sent a message over the socket right before the socket is shutting down but no response has been sent out? Shouldn't the socket connection drop and the client gets a disconnected error? Maybe that's not working correctly in this case? I would appreciate if you could have a look at the log and if there is something visible. Thanks!
Assignee | ||
Comment 6•1 year ago
|
||
For now I pushed a new try build which reduces the socket timeout to 10s when calling a Marionette command while we know that the content process can crash every moment. Lets see if that works.
Assignee | ||
Comment 7•1 year ago
|
||
Updated•1 year ago
|
Assignee | ||
Comment 8•1 year ago
|
||
With the attached patch we can reduce the default socket timeout from 360s to just 10s which means that we save around 6 minutes per test job. Per central merge and on release branches we run 8 jobs per push, which frees up roughly 47 minutes of machine time.
Assignee | ||
Updated•1 year ago
|
Comment 9•1 year ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #5)
Valentin, what should happen when a client has sent a message over the socket right before the socket is shutting down but no response has been sent out? Shouldn't the socket connection drop and the client gets a disconnected error? Maybe that's not working correctly in this case? I would appreciate if you could have a look at the log and if there is something visible. Thanks!
Not sure if this is an issue in necko or marionette. We actually complete the shutdown here, and it seems to me like the GetCurrentURL call comes after that, so there's no response.
Assignee | ||
Comment 10•1 year ago
|
||
Thanks Valentin! I actually do not have an idea here given that this is not a persistent issue but only happens on Windows in our CI. Whether other platforms nor locally I can replicate this hang. Generally we use a 360s timeout for the socket, which explains the overall delay when waiting for a response. After that timeout the Python code raises the expected socket timeout issue. But the question is still what delays the final socket closure and delays that so much on in CI. It's permanent there and can be seen for each and every job.
Comment 11•1 year ago
|
||
I was actually able to reproduce it locally on my win10 machine and confirmed that the test was still hanging after the firefox process had ended.
I think it may be some odd windows quirk that a socket to a recently open port does not return connection refused - or it may be some actual necko bug. I think adding a shorter socket timeout is definitely a big win in terms of CI efficiency.
If you can summarize how GetCurrentURL works in marionette, I can look into whether there's some underlying necko issue here (feel free to file a separate bug for that)
Assignee | ||
Comment 12•1 year ago
|
||
Thanks Valentin! I filed bug 1840701 for further investigation. So this bug will just cover the reduced socket timeout change.
Comment 13•1 year ago
|
||
Comment 14•1 year ago
|
||
bugherder |
Assignee | ||
Comment 15•1 year ago
|
||
Given that we have to run those tests on 115esr for quite some time I would suggest that we get this test-only patch uplifted. Beta is not necessary given that the merge will happen early next week anyway. I'll request uplift for 115esr later this week.
Assignee | ||
Comment 16•1 year ago
|
||
As it looks like Valentin found the real underlying problem on bug 1840701. As such I will not request an uplift here but actually get the correct fix landed and uplifted via that other bug. I'm also considering to revert this patch if it works well without.
Assignee | ||
Comment 17•1 year ago
|
||
Sheriffs, can we please get this patch backed out? With the fix on bug 1840701 it's no longer necessary. Once done we should dupe this bug to bug 1840701. Thanks.
Comment 18•1 year ago
|
||
Backed out changeset c9443abad96a (bug 1828902) as requested by whimboo for no longer being necessary.
Backout link: https://hg.mozilla.org/integration/autoland/rev/ef7d9ea0c3f751e7c525a136923c9c404c297b1f
Will be merged to central later on.
Comment 19•1 year ago
|
||
Backout merged to central: https://hg.mozilla.org/mozilla-central/rev/ef7d9ea0c3f7
Updated•1 year ago
|
Description
•