Closed Bug 1828902 Opened 2 years ago Closed 1 year ago

On Windows the test "test_crash.py TestCrash.test_crash_content_process" runs about 360s

Tracking

(firefox-esr115 wontfix, firefox116 fixed)

Status:

RESOLVED DUPLICATE of bug 1840701

Milestone:

116 Branch

Tracking Flags:

Tracking

Status

firefox-esr115

---

wontfix

firefox116

---

fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

(Whiteboard: [webdriver:m7])

Attachments

(1 file)

Bug 1828902 - [marionette] Make test "test_crash.py TestCrash.test_crash_content_process" faster on Windows. 1 year ago Henrik Skupin [:whimboo][⌚️UTC+2] 48 bytes, text/x-phabricator-request		Details \| Review

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Description

•

2 years ago

•

Edited

I've noticed locally that the test test_crash.py TestCrash.test_crash_content_process runs for about 360s on Windows but only 4s on Linux and MacOS.

https://treeherder.mozilla.org/logviewer?job_id=412932680&repo=mozilla-central&lineNumber=15361-15364

[task 2023-04-18T23:43:26.806Z] 23:43:26     INFO -  TEST-START | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrash.test_crash_content_process
[task 2023-04-18T23:43:26.806Z] 23:43:26     INFO -  Suppressing GECKO output. To view, add `print(self.new_out.getvalue())` to the end of this test.
[task 2023-04-18T23:49:27.175Z] 23:49:27     INFO -  Application command: Z:\task_168185948858909\build\application\firefox\firefox.exe -no-remote -marionette --wait-for-browser -profile C:\Users\task_168185948858909\AppData\Local\Temp\tmpzfa4vyk4.mozrunner
[task 2023-04-18T23:49:29.088Z] 23:49:29     INFO -  TEST-PASS | testing/marionette/harness/marionette_harness/tests/unit/test_crash.py TestCrash.test_crash_content_process | took 362289ms

It can be fixed by setting the socket timeout to 10s as well for the get_url() call so that there is no hang in sock.recv() anymore. We should investigate possible solutions to make the Mn test job faster.

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

2 years ago

The severity field is not set for this bug.
:whimboo, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

2 years ago

Points: --- → 2

Flags: needinfo?(hskupin)

Priority: -- → P2

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

2 years ago

Severity: -- → S3

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 2

•

1 year ago

•

Edited

I'm not able to replicate this long duration of the test locally. As such I enabled the logging output for crash tests and pushed to try:
https://treeherder.mozilla.org/jobs?repo=try&revision=515a50c903f539fd7fe4cb21b6dcbc17aaed25fa

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 3

•

1 year ago

The following log lines show such a delay which actually happens during late shutdown of Firefox:
https://treeherder.mozilla.org/logviewer?job_id=420146472&repo=try&lineNumber=1534-1538

What's interesting is the delay at the very end before the browser shuts down and when we start the new Firefox process:

[task 2023-06-21T14:53:08.096Z] 14:53:08     INFO -  DEBUG: Starting phase web-workers-shutdown
[task 2023-06-21T14:53:08.096Z] 14:53:08     INFO -  DEBUG: Spinning the event loop
[task 2023-06-21T14:53:08.096Z] 14:53:08     INFO -  DEBUG: Finished phase web-workers-shutdown
[task 2023-06-21T14:59:08.004Z] 14:59:08     INFO -  Application command: Z:\task_168735841901231\build\application\firefox\firefox.exe -no-remote -marionette --wait-for-browser -profile C:\Users\task_168735841901231\AppData\Local\Temp\tmpx8f7ppfm.mozrunner
[task 2023-06-21T14:59:08.145Z] 14:59:08     INFO -  DEBUG: Adding blocker AddonManager: shutting down. for phase profile-before-change

I wonder if this might be caused by the Marionette test, harness or client. So I'll add some more print calls to the Python code.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 4

•

1 year ago

Ok, so the problem here is indeed a hang when sending the Marionette command and the server side is not responding anymore:
https://treeherder.mozilla.org/logviewer?job_id=420277166&repo=try&lineNumber=1615-1714

We are sending WebDriver:GetCurrentURL at 12:43:27 with in the call to Wait.until() but that doesn't return until 12:49:27 when the log prints *** after wait. These are exactly 360s which is the default socket timeout for Marionette.

It's not clear to me why we do not see this problem on other platforms or on my local Windows machine. Maybe the Socket is in some cases not directly closed but remains in a state where it doesn't see incoming data anymore, and as such cannot send any reply?

I'll create a new try build with nsSocketTransport:5 logging in Firefox enabled. Maybe this gives an indication what's wrong.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 5

•

1 year ago

My suspicion was probably right. The socket gets shutdown but then we end-up hanging:
https://treeherder.mozilla.org/logviewer?job_id=420328660&repo=try&lineNumber=4672-4673

Valentin, what should happen when a client has sent a message over the socket right before the socket is shutting down but no response has been sent out? Shouldn't the socket connection drop and the client gets a disconnected error? Maybe that's not working correctly in this case? I would appreciate if you could have a look at the log and if there is something visible. Thanks!

Flags: needinfo?(valentin.gosu)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 6

•

1 year ago

For now I pushed a new try build which reduces the socket timeout to 10s when calling a Marionette command while we know that the content process can crash every moment. Lets see if that works.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 7

•

1 year ago

Attached file Bug 1828902 - [marionette] Make test "test_crash.py TestCrash.test_crash_content_process" faster on Windows. — Details

Phabricator Automation

Updated

•

1 year ago

Assignee: nobody → hskupin

Status: NEW → ASSIGNED

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 8

•

1 year ago

With the attached patch we can reduce the default socket timeout from 360s to just 10s which means that we save around 6 minutes per test job. Per central merge and on release branches we run 8 jobs per push, which frees up roughly 47 minutes of machine time.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

1 year ago

Whiteboard: [webdriver:m7]

Valentin Gosu [:valentin] (he/him)

Comment 9

•

1 year ago

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #5)

Valentin, what should happen when a client has sent a message over the socket right before the socket is shutting down but no response has been sent out? Shouldn't the socket connection drop and the client gets a disconnected error? Maybe that's not working correctly in this case? I would appreciate if you could have a look at the log and if there is something visible. Thanks!

Not sure if this is an issue in necko or marionette. We actually complete the shutdown here, and it seems to me like the GetCurrentURL call comes after that, so there's no response.

Flags: needinfo?(valentin.gosu)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 10

•

1 year ago

Thanks Valentin! I actually do not have an idea here given that this is not a persistent issue but only happens on Windows in our CI. Whether other platforms nor locally I can replicate this hang. Generally we use a 360s timeout for the socket, which explains the overall delay when waiting for a response. After that timeout the Python code raises the expected socket timeout issue. But the question is still what delays the final socket closure and delays that so much on in CI. It's permanent there and can be seen for each and every job.

Valentin Gosu [:valentin] (he/him)

Comment 11

•

1 year ago

I was actually able to reproduce it locally on my win10 machine and confirmed that the test was still hanging after the firefox process had ended.
I think it may be some odd windows quirk that a socket to a recently open port does not return connection refused - or it may be some actual necko bug. I think adding a shorter socket timeout is definitely a big win in terms of CI efficiency.
If you can summarize how GetCurrentURL works in marionette, I can look into whether there's some underlying necko issue here (feel free to file a separate bug for that)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

1 year ago

Comment 12

•

1 year ago

Thanks Valentin! I filed bug 1840701 for further investigation. So this bug will just cover the reduced socket timeout change.

Pulsebot

Comment 13

•

1 year ago

Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c9443abad96a [marionette] Make test "test_crash.py TestCrash.test_crash_content_process" faster on Windows. r=webdriver-reviewers,jdescottes

Sandor Molnar[:smolnar]

Comment 14

•

1 year ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c9443abad96a

Status: ASSIGNED → RESOLVED

Closed: 1 year ago

status-firefox116: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 116 Branch

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 15

•

1 year ago

Given that we have to run those tests on 115esr for quite some time I would suggest that we get this test-only patch uplifted. Beta is not necessary given that the merge will happen early next week anyway. I'll request uplift for 115esr later this week.

status-firefox-esr115: --- → affected

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 16

•

1 year ago

As it looks like Valentin found the real underlying problem on bug 1840701. As such I will not request an uplift here but actually get the correct fix landed and uplifted via that other bug. I'm also considering to revert this patch if it works well without.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

1 year ago

Depends on: 1840701

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 17

•

1 year ago

Sheriffs, can we please get this patch backed out? With the fix on bug 1840701 it's no longer necessary. Once done we should dupe this bug to bug 1840701. Thanks.

Flags: needinfo?(sheriffs)

Cosmin Sabou [:CosminS]

Comment 18

•

1 year ago

Backed out changeset c9443abad96a (bug 1828902) as requested by whimboo for no longer being necessary.

Backout link: https://hg.mozilla.org/integration/autoland/rev/ef7d9ea0c3f751e7c525a136923c9c404c297b1f

Will be merged to central later on.

Duplicate of bug: 1840701

Flags: needinfo?(sheriffs)

Resolution: FIXED → DUPLICATE

Sandor Molnar[:smolnar]

Comment 19

•

1 year ago

Backout merged to central: https://hg.mozilla.org/mozilla-central/rev/ef7d9ea0c3f7

Ryan VanderMeulen [:RyanVM]

Updated

•

1 year ago

status-firefox-esr115: affected → wontfix

You need to log in before you can comment on or make changes to this bug.