Closed Bug 1864641 Opened 1 year ago Closed 5 days ago

MacOS specific part of HangMonitorChild::RecvSetMainThreadQoSPriority seems to hang frequently

Categories

(Core :: XPCOM, defect)

Unspecified
macOS
defect

Tracking

()

RESOLVED DUPLICATE of bug 1876306
Tracking Status
firefox-esr115 --- unaffected
firefox119 --- disabled
firefox120 --- disabled
firefox121 --- disabled
firefox122 --- disabled

People

(Reporter: jstutte, Unassigned)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: regression)

Attachments

(1 obsolete file)

In the recent ShutdownKill data all MacOS instances I clicked on were stuck inside HangMonitorChild::RecvSetMainThreadQoSPriority.

Component: DOM: Content Processes → XPCOM
Keywords: regression
Regressed by: 1834629

Set release status flags based on info from the regressing bug 1834629

:KrisWright, since you are the author of the regressor, bug 1834629, could you take a look? Also, could you set the severity field?

For more information, please visit BugBot documentation.

Looks like something's causing a hang related to the new codepath. I'm curious if it's related to some recent content process crashes that make the main thread unable to change contexts, resulting in a hang. I'll look into this. As it stands, this code is still in the experiment stage and hasn't been introduced outside of nightly populations outside of the experiment.

Assignee: nobody → kwright
Severity: -- → S3
Flags: needinfo?(kwright)

Set release status flags based on info from the regressing bug 1834629

For posterity, this is gated on the threads.use_low_power.enabled pref.

pthread_override_qos_class_start_np can return NULL. In such a case, in the current code, if the dispatch fails, we'll call pthread_override_qos_class_end_np(NULL) which might be what we're hanging on. I'll build a patch.

I'll take the Bug since I can reproduce it fairly consistently, using these Steps to Reproduce:

  1. Start building Firefox, causing CPU usage to reach >95% usage or higher via clang. It's likely that other methods of increasing CPU usage would also work, but I haven't been able to demonstrate that. In theory, multiple invocations of yes > /dev/null & will do this.
  2. While CPU usage is still high, navigate to "https://www.polygon.com/archives".
  3. Scroll up and down a few times, then click the "Next" button at the bottom. Switch to another window, then back to the browser window.
  4. Repeat Step 3 until the browser hangs.

This method works for me to cause a hang quite consistently, though it usually takes 5 minutes to make it happen. It's not easy to replicate, but it is consistent.

Assignee: kwright → bwerth

Here's one of my crash reports, generated once the hang has occurred, and then I force-quit the application. https://crash-stats.mozilla.org/report/index/6bb3fa75-f252-4a4d-bff1-dce380231129

Just reproduced this again, this time while watching a Twitch video during background compilation.

https://crash-stats.mozilla.org/report/index/dcfae4fa-5cbd-4b12-ba85-677420240112

I don't think I'm equipped to solve this. Taking myself off the Bug.

Assignee: bwerth → nobody
Attachment #9366929 - Attachment is obsolete: true
See Also: → 1872850
See Also: → 1876306

I had a similar hang today. I switched to a phabricator tab that I had, and the content process was unresponsive for a long time. I captured a profile and it was spending the whole time in __bsdthread_ctl that's inside HangMonitorChild::RecvSetMainThreadQoSPriority: https://share.firefox.dev/4azCaTY

Nazim, do you remember if you dragged this phabricator tab into a different window? I just encountered a frozen foreground tab after I had dragged a tab into a different window and I wonder if that's just a code path where we're not sending the "force qos change" signal.

Hmm, good question. I don't remember doing it but I might have mistakenly dragged it a bit while trying to select it. But it should still be in the same window after the attempt as I mostly use a single window.

Blocks: 1895985

It was not confirmed, but we believe this was fixed by bug 1876306. (We kept this bug open while we monitored crash reports because we weren't sure the problem addressed in bug 1876306 caused the issues on this bug.) With bug 1876306 fixed, we no longer see instances of these shutdownkill crashes in crash-stats.

Status: NEW → RESOLVED
Closed: 5 days ago
Duplicate of bug: 1876306
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: