Closed Bug 1884401 Opened 2 months ago Closed 1 month ago

Use "psutil" as process handler for mozrunner / mozprocess on MacOS to support the new application restart mechanism in Firefox (bug 1827651)

Categories

(Testing :: Marionette Client and Harness, defect, P2)

defect
Points:
5

Tracking

(firefox127 fixed)

RESOLVED FIXED
127 Branch
Tracking Status
firefox127 --- fixed

People

(Reporter: spohl, Assigned: whimboo)

References

(Blocks 4 open bugs)

Details

(Keywords: perf-alert, Whiteboard: [webdriver:m11])

Attachments

(2 files, 1 obsolete file)

We are reworking the way we launch and relaunch applications on macOS in bug 1827651. We are blocked from landing the code because we are running into Marionette test failures that we can't quite explain yet (https://treeherder.mozilla.org/jobs?repo=try&revision=e73afb0ca12508a43f65b1853c478116b44eee6c). In my conversations with Henrik on Matrix it became clear that this might require a change to Marionette, rather than changes to the patch in bug 1827651.

The easiest way that I found to reproduce the issue is by applying the patch in bug 1827651 (moz-phab patch D186701) and only running test_quit_restart.py (see attached diff). Marionette expects the returncode of the first process to still be None (which means "still running"). However, with the patch in bug 1827651 applied, the returncode is actually 0, the restarted process just sits around (which ultimately causes the timeouts on try) and Marionette even starts a new session because it didn't expect a returncode of 0.

Whiteboard: [webdriver:triage]
Points: --- → 3
Priority: -- → P2
Whiteboard: [webdriver:triage] → [webdriver:m10]
Assignee: nobody → hskupin
Status: NEW → ASSIGNED

After some investigation today it turns out to be actually a bigger issue. Here some details:

  • The current method for managing restarts in Firefox relies on the assumption that the exit code of the original process remains as None. This behavior is a result of the existing implementation and usage of mozrunner and mozprocess, where the reader thread of the process handler never terminates, leaving the returncode as None.

  • With this behavior in mind, we (unintentionally) internally update the detached process ID (pid) and manage that instead of the original one. This allows us to continue correctly piping stdout and stderr to our logger.

  • With the implementation of the new mechanism for shutting down Firefox (as addressed in bug 1827651), there's a significant change. Now, during a restart, the reader thread exits once the process has been closed, resulting in a returncode of 0. Consequently, mozrunner and mozprofile are no longer able to function with the new process. Internally, this means they cannot attach to the new process and assume that it has quit.

  • Additionally, the reader threads are no longer active, meaning any output to stdout and stderr is lost. While I've encountered this in the past on occasion, it's now a permanent situation with this new mechanism.

Considering the situation described above, a potential fix for mozrunner might involve moving away from mozprocess and utilizing psutil instead. However, this would be quite an invasive change.

Geoff or Joel, has there been any consideration or discussion about porting mozrunner to utilize psutil instead of mozprocess?

Stephen, would you happen to know what internal changes are involved with the new shutdown mechanism for the application? Could it possibly include the proactive closing of the stdout and stderr streams, a behavior that wasn't present before?

Flags: needinfo?(spohl.mozilla.bugs)
Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)

I haven't had any discussions (formal or casual) about changing any mozbase modules (except manifestparser) in the last 5 years. I recall :gbrown doing some changes to reduce the scope of some modules. I am game for reducing our reliance on mozprocess, I also suspect modern python and newer OS versions allow for less hacking.

As for bandwidth, there is no "A-team" that works on mozprocess anymore, we all have different day jobs!

Flags: needinfo?(jmaher)

Bug 1843347 is my now long-neglected attempt to reduce use of mozprocess. I haven't looked at mozrunner lately. I think it would be great if it no longer used mozprocess.

Flags: needinfo?(gbrown)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #1)

Stephen, would you happen to know what internal changes are involved with the new shutdown mechanism for the application? Could it possibly include the proactive closing of the stdout and stderr streams, a behavior that wasn't present before?

I can't say what the internal differences are between the NSTask API that was used to launch applications before and the new NSWorkspace route. Does Firefox somehow have awareness that it was launched by Marionette? We have replaced the legacy NSTask-way of starting applications with two new ways, one using a more recent NSTask API and one using NSWorkspace. One possible workaround could be to use the NSTask-route when launched by Marionette. While this doesn't test the NSWorkspace-route, it might be a temporary solution to land bug 1827651 for Firefox and enable macOS session resume while Marionette is being switched from mozprocess to psutil.

Flags: needinfo?(spohl.mozilla.bugs) → needinfo?(hskupin)

(In reply to Geoff Brown [:gbrown] from comment #3)

Bug 1843347 is my now long-neglected attempt to reduce use of mozprocess. I haven't looked at mozrunner lately. I think it would be great if it no longer used mozprocess.

Yeah, touching mozrunner in this regard might be a challenging task and not easy to get done in a short time frame.

(In reply to Stephen A Pohl [:spohl] from comment #4)
route. Does Firefox somehow have awareness that it was launched by Marionette? We have replaced the legacy NSTask-way of starting applications with two new ways, one using a more recent NSTask API and one using NSWorkspace. One possible workaround could be to use the NSTask-route when launched by Marionette. While this doesn't test the NSWorkspace-route, it might be a temporary solution to land bug 1827651 for Firefox and enable macOS session resume while Marionette is being switched from mozprocess to psutil.

It's not registered in the platform which tool launched the application, and this includes Marionette. However, Marionette is quite unique in that it has to support application restarts, a requirement that no other test suite or harness has. Besides the Python client we've discussed, Firefox has the Marionette and Remote Agent (WebDriver BiDi) components. Both of these could be queried via XPConnect to check if they are enabled and running. You can find an example of this in the code here.

Nonetheless, if we were to add a switch for now, I might lean towards using a preference. This approach would allow external tools to control the shutdown behavior in Firefox, making it easier for us to implement support for psutil (or another tool) as a replacement for the processhandler Python package.

What do you think?

Flags: needinfo?(hskupin) → needinfo?(spohl.mozilla.bugs)

I've done some more investigation yesterday:

  1. Using psutil for just Marionette to start the process and close it works fine. The benefit is that by using psutil.Popen() we can start our own binary so it's mostly behaving similar with processhandler except the stream reader code doesn't work - but maybe we won't need that or could do it in a different way. But what's most helpful is that with psutil.Process() we can attach to a different process that was not launched by us. All of the members of this class are also mapped into psutil.Popen() so we have a common set of helpers.

  2. Using psutil.Process() to attach to a process I was not able to get the stdout and stderr output handling working. Both streams only work for processes that we start with psutil.Popen() by using the appropriate arguments for the constructor, or by accessing the properties afterward. But note that these are not present on psutil.Process(). Maybe someone has an idea how to get this accomplished?

  3. psutil.Process instances have a lot of very helpful members which would allow us to give better error messages in case something doesn't work correctly with the managed process.

  4. With the new restart mechanism for Firefox on MacOS the new process is started in a way that no terminal is attached anymore. That results in no output of any log details in my terminal anymore. It also means that we can no longer capture this output which would be crucial for our testing.

As conclusion we could try to remove the custom Process implementation in mozprocess and replace it with using psutil.Popen which would fit most of our use cases for mozprocess. For Marionette it's more complicated given that we must support the restart feature and I cannot find a way to connect to the stdout and stderr streams.

Depends on: 1885431

I've spent some more time by end of last week for using psutil only for Marionette given that we cannot that easily replace the current mozprocess implementation with it. Especially for MacOS where we definitely need updated code for process handling it works pretty fine and (maybe unrelated) some wpt tests now pass permanently:

https://treeherder.mozilla.org/jobs?repo=try&revision=c2b72576572ef16b9e45b2f321ebe10961d83400

There are issues still for Linux and Windows which so far I cannot reproduce locally but which seem to fail permanently in CI.

As such we may consider using or own psutil process handler class for mozprocess just for MacOS for now and replace it for other platforms later.

Note that there is also still an issue to retrieve the exit code of a restarted process given that it's not under our own control (via POpen) no returncode property is available even not with psutil. That means that with a restarted process we don't know what caused the application to quit and if that was a crash (we would have to check for the minidump folder). Happy to hear feedback in case there is some other way to retrieve that information.

We should discuss if such a replacement for Marionette makes sense or if there are other methods that would allow to handle non-child processes that elegantly as psutil.

Whiteboard: [webdriver:m10] → [webdriver:m10][webdriver:triage]
Blocks: 1887666

As discussed with the team we will go ahead by using psutil for Marionette via a drop-in replacement in mozrunner. That means all other consumers are not affected! As first step we will also only do it for MacOS and follow-up with other platforms and features later.

Points: 3 → 5
Flags: needinfo?(spohl.mozilla.bugs)
Summary: Investigate and fix issues in Marionette related to the new application restart mechanism in Firefox (bug 1827651) → Use "psutil" as process handler for mozrunner / mozprocess on MacOS to support the new application restart mechanism in Firefox (bug 1827651)
Whiteboard: [webdriver:m10][webdriver:triage] → [webdriver:m10]
Blocks: 1887670
Blocks: 1887672
Blocks: 1885431
No longer depends on: 1885431
Attachment #9393551 - Attachment description: Bug 1884401 - [wpt] Update metadata for intermittently passing tests. → WIP: Bug 1884401 - [wpt] Update metadata for intermittently passing tests.
Attachment #9393552 - Attachment description: Bug 1884401 - [marionette] Use psutil on MacOS as process handler for mozrunner. → WIP: Bug 1884401 - [marionette] Use psutil on MacOS as process handler for mozrunner.
Attachment #9393551 - Attachment description: WIP: Bug 1884401 - [wpt] Update metadata for intermittently passing tests. → Bug 1884401 - [wpt] Update metadata for intermittently passing tests.
Attachment #9393552 - Attachment description: WIP: Bug 1884401 - [marionette] Use psutil on MacOS as process handler for mozrunner. → Bug 1884401 - [marionette] Use psutil on MacOS as process handler for mozrunner.

The severity field is not set for this bug.
:whimboo, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(hskupin)
Severity: -- → S3
Flags: needinfo?(hskupin)
Whiteboard: [webdriver:m10] → [webdriver:m11]
Attachment #9393552 - Attachment description: Bug 1884401 - [marionette] Use psutil on MacOS as process handler for mozrunner. → Bug 1884401 - [marionette] Use psutil on MacOS as custom process handler drop-in for mozrunner.
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ad37a0c06526
[marionette] Use psutil on MacOS as custom process handler drop-in for mozrunner. r=webdriver-reviewers,spohl,jdescottes
Attachment #9393551 - Attachment is obsolete: true

These are the Python Mochitest harness tests that are failing. I'm not sure why that is happening and I cannot reproduce locally. I've pushed a try build with Trace logs enabled. Maybe it will give a bit more details:

https://treeherder.mozilla.org/jobs?repo=try&revision=416f55f3f2d2924854169cf9d72a0bf6fa648978

Flags: needinfo?(hskupin)

The problem with the patch is actually that there is no conditional import of the new processhandler class for MacOS only. Given that on other platforms we do not require psutil the related import fails. Conditionally importing fixes the issue:

https://treeherder.mozilla.org/jobs?repo=try&revision=133ffa960cf1cdd58cebd4a2e612dcfc0b14e66c

I'll add the extra line to the patch and reland.

Attachment #9393552 - Attachment description: Bug 1884401 - [marionette] Use psutil on MacOS as custom process handler drop-in for mozrunner. → Bug 1884401 - [marionette] Use psutil on MacOS as process handler for mozrunner.
Attachment #9393552 - Attachment description: Bug 1884401 - [marionette] Use psutil on MacOS as process handler for mozrunner. → Bug 1884401 - [marionette] Use psutil on MacOS as custom process handler drop-in for mozrunner.
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e96fce323a1b
[marionette] Use psutil on MacOS as custom process handler drop-in for mozrunner. r=webdriver-reviewers,spohl,jdescottes
Regressions: 1892044
Status: ASSIGNED → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
Target Milestone: --- → 127 Branch
No longer blocks: 1885431

(In reply to Sandor Molnar[:smolnar] from comment #13)

Backed out for causing py3 unit test failures

Backout link: https://hg.mozilla.org/integration/autoland/rev/b7c51567545cbceac3798fa536245baf600b18d4

Push with failures

Failure log -> testing/mochitest/tests/python/test_mochitest_integration.py::test_output_extra_args[plain-mochitest-args.ini] TEST-UNEXPECTED-FAIL

L.E

Also these perftest failures

Perfherder has detected a build_metrics performance change from push b7c51567545cbceac3798fa536245baf600b18d4.

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
1% installer size osx-cross 94,985,376.58 -> 94,154,010.92

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

You can run these tests on try with ./mach try perf --alert 42263

For more information on performance sheriffing please see our FAQ.

Keywords: perf-alert
Regressions: 1892940
Regressions: 1896106
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: