Closed Bug 1722722 Opened 4 years ago Closed 4 years ago

Very high CPU usage on OSX under QuartzCore Transaction::Commit

Categories

(Core :: Graphics: Layers, defect)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 - fixed
firefox90 --- wontfix
firefox91 --- wontfix
firefox92 --- wontfix
firefox93 --- fixed

People

(Reporter: mccr8, Assigned: mstange)

References

Details

(Keywords: regression, regressionwindow-wanted)

Attachments

(1 file)

When I visit this page on OSX, I get very high CPU usage in the Firefox main process (above 150%) and in the WindowServer process (around 140%): https://www.cvs.com/store-locator/sunnyvale-ca-pharmacies/1165-w-el-camino-real-sunnyvale-ca-94087/storeid=9856

Here's a profile: https://share.firefox.dev/3x4CAgB

It looks like the profile is spending most of its time inside some QuartzCore routines, but I could be mistaken.

I've noticed similar behavior starting a day or two ago on Twitch, with I think similar-looking profiles, but it was less surprising there so I hadn't filed a bug for it yet. I'm on a 7-26 Nightly. I'll see if restarting helps.

Summary: Very high CPU usage on OSX → Very high CPU usage on OSX under QuartzCore Transaction::Commit

The issue seems to have gone away after I updated to today's Nightly. I'll reopen if it shows up again.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME

Okay, I'm still seeing this issue. I'm not sure why it wasn't showing up immediately on restart. In addition to Twitch and the CVS page, I'm also seeing it on YouTube. The high CPU usage is making my fans spin up.

Status: RESOLVED → REOPENED
Keywords: regression
Resolution: WORKSFORME → ---

[Tracking Requested - why for this release]: seems like a regression, and it makes watching video on OSX (or even visiting CVS for some reason) rather annoying.

Severity: -- → S3
Component: Graphics → Graphics: Layers

How many windows do you have open? Can you collect another profile without Screenshots? Some of the CPU time on the Renderer thread is spent in screenshots code. Also, can you capture a profile of the same workload when it's still in the "good" state, for comparison purposes?

Flags: needinfo?(continuation)

The profiles look similar bug 1689215 comment 4. The main thread CPU overhead scales with the number of runnables. The profile from comment 0 shows a number of "layers::ChromeProcessController::NotifyLayerTransforms" runnables on the main thread; I wonder if that's a regression.

(In reply to Markus Stange [:mstange] from comment #4)

How many windows do you have open?

Can you collect another profile without Screenshots? Some of the CPU time on the Renderer thread is spent in screenshots code.

Sure. I guess I should use the "graphics" setting for the profile? That seems to not include the screenshots.

Also, can you capture a profile of the same workload when it's still in the "good" state, for comparison purposes?

Here's a profile on CVS after a fresh restart where the Firefox main process usage was around 30%: https://share.firefox.dev/3fbXpAn

I'll try to remember to get a sample of the WindowServer process when it is acting up again.

Flags: needinfo?(continuation)
QA Whiteboard: [qa-regression-triage]

Here's a capture while watching YouTube and the CPU was around 100%: https://share.firefox.dev/3l9oXu9

Attached file WindowServer sample

Here's a sample from ActivityMonitor of WindowServer when the CPU was around 140%. (The CPU usage for Firefox and WindowServer jumped up a bit after I took my last profile, FWIW.)

I had about 4 windows open when I took those.

I'm still not sure what causes the issue. I had a Firefox session running for quite a while yesterday and it was fine, and then the first YouTube video I played this morning hit the issue, so I thought maybe putting the machine to sleep might trigger it, but at least just doing that for a few minutes didn't bring it back after a browser restart. Nightly/WindowServer are still around 35%/25% CPU.

I looked at the main thread runnable counts for a one second window in both the good profile and the bad profile: they're 104 and 333. But the CPU usage increases by more than 3x between those profiles. So runnable count is not an adequate explanation.

Looks like we need to debug CoreAnimation and find out why it thinks the layer tree keeps changing. Maybe the fact that we're compositing on a non-main thread confuses its change tracking.

Independently from that, the bad profile has a lot more nsHttpActivityDistributor::ObserveActivityWithArgs runnables (152 during 1 second) and nsPipeInputStream::AsyncWait runnables (121 during 1 second).

Now I'm getting similar behavior just opening a bunch of Bugzilla tabs from Gmail. https://share.firefox.dev/3iDK6en
(edit: "from Gmail" not "from Bugzilla")

Assigning to myself for now. Debugging this will probably mean repeatedly stepping through assembly code in CoreAnimation...

Assignee: nobody → mstange.moz
Status: REOPENED → ASSIGNED

I've done a bit of assembly stepping, but in the "good" state, not in the bad state. I haven't really made a ton of progress. All I can see is that there's no obvious "return early" path. It seems perfectly expected that, on every turn through the native event loop, CA::Context::commit_transaction will be called, an Encoder will be created, and a message will be sent to the WindowServer... for all CAContexts, i.e. for all windows. This seems super wasteful to me, but maybe that's just the way things are.
What's not expected is that each call takes more than a millisecond. This probably only happens when we're in the "bad" state. But I have yet to find a way to get into that state.

Andrew, can you run Firefox with the following environment variables: CA_LOG_MEMORY_USAGE=1 CA_PRINT_OBJECTS=1 CA_PERF_WARNINGS=1 CA_PRINT_TRANSACTIONS=1
and then, when the problem manifests itself, open Console.app, start recording, and see if anything interesting related to CoreAnimation appears?
Ideally, we'd get in a state where the window doesn't do any painting, but CA still thinks it has to composite. This might require switching to a "quiet" tab, such as about:config.

Flags: needinfo?(continuation)

Will do. Thanks for investigating this. I haven't seen the issue recently, but I've also been studiously avoiding watching any video in Firefox. I'll have to start doing that again.

Flags: needinfo?(continuation)

I believe I am experiencing this same issue, or something very similar. Playing videos doesn't appear to trigger the issue for me, so let me know if I should file a separate bug instead. I haven't quite worked out the reproduction steps yet, so I'll just describe what I've found so far.

I have a large default profile, with ~500 tabs and ~30 windows. I am testing with Firefox Nightly 2021-08-23.

Just after startup

After initially starting Firefox and letting everything settle in, we reach a baseline of about 20% CPU use by Firefox and 20% CPU use by WindowServer. A profile from this time looks like https://share.firefox.dev/3Dkl1xp. 1% of parent process time is spent in CA::Transaction::commit. Overall, for driving so many windows, it seems pretty good. Ideally there would be no work to do at all, but that might be hard to achieve.

With the extra logging variables, Console shows logs like the following during this period:

default	21:13:31.678719+0100	firefox	CoreAnimation buffers: currently 1.51MB; maximum 1.78MB
default	21:13:31.920274+0100	firefox	Began transaction (0x12f981000) change on layer 0x7fa8f1a278b0 for contents, 0x0
default	21:13:31.920401+0100	firefox	Transaction (0x12f981000) property contents changed
default	21:13:31.938102+0100	firefox	Began transaction (0x12f981000) change on layer 0x7fa8f1a278b0 for contents, 0x0
default	21:13:31.938201+0100	firefox	Transaction (0x12f981000) property contents changed
default	21:13:31.954158+0100	firefox	Began transaction (0x12f981000) change on layer 0x7fa8f1a278b0 for contents, 0x0
default	21:13:31.954245+0100	firefox	Transaction (0x12f981000) property contents changed
default	21:13:31.971332+0100	firefox	Began transaction (0x12f981000) change on layer 0x7fa8f1a278b0 for contents, 0x0
default	21:13:31.971421+0100	firefox	Transaction (0x12f981000) property contents changed
default	21:13:31.988420+0100	firefox	Began transaction (0x12f981000) change on layer 0x7fa8f1a278b0 for contents, 0x0
default	21:13:31.988609+0100	firefox	Transaction (0x12f981000) property contents changed
default	21:13:32.004873+0100	firefox	Began transaction (0x12f981000) change on layer 0x7fa8f1a278b0 for contents, 0x0
default	21:13:32.004964+0100	firefox	Transaction (0x12f981000) property contents changed

There appear to be 60 CA transactions / sec, so perhaps that's tied to the monitor refresh rate...?

Later in the day

Then, "after a while" of use (one day later, same session, no restarts, no clear trigger), the steady state has risen to more like 50% CPU use by Firefox and 50% CPU use by WindowServer. A profile from this time looks like https://share.firefox.dev/2WmdMnw. CA::Transaction::commit has grown from 1% parent process time in the previous profile to 8% in this one.

The Console logs seem basically the same as before:

default	15:15:42.819247+0100	firefox	CoreAnimation buffers: currently 2.73MB; maximum 66.2MB
default	15:15:42.906346+0100	firefox	Began transaction (0x12f981000) change on layer 0x7fa8f1a278b0 for contents, 0x0
default	15:15:42.906419+0100	firefox	Transaction (0x12f981000) property contents changed
default	15:15:42.922127+0100	firefox	Began transaction (0x12f981000) change on layer 0x7fa8f1a278b0 for contents, 0x0
default	15:15:42.922187+0100	firefox	Transaction (0x12f981000) property contents changed
default	15:15:42.941709+0100	firefox	Began transaction (0x12f981000) change on layer 0x7fa8f1a278b0 for contents, 0x0
default	15:15:42.941776+0100	firefox	Transaction (0x12f981000) property contents changed
default	15:15:42.960844+0100	firefox	Began transaction (0x12f981000) change on layer 0x7fa8f1a278b0 for contents, 0x0
default	15:15:42.960903+0100	firefox	Transaction (0x12f981000) property contents changed
default	15:15:42.977971+0100	firefox	Began transaction (0x12f981000) change on layer 0x7fa8f1a278b0 for contents, 0x0
default	15:15:42.978038+0100	firefox	Transaction (0x12f981000) property contents changed
default	15:15:42.995711+0100	firefox	Began transaction (0x12f981000) change on layer 0x7fa8f1a278b0 for contents, 0x0
default	15:15:42.995779+0100	firefox	Transaction (0x12f981000) property contents changed

The internal CA buffer size is a bit larger than before, but also doesn't seem to be leaking or anything like that.

(In reply to Markus Stange [:mstange] from comment #15)

Andrew, can you run Firefox with the following environment variables: CA_LOG_MEMORY_USAGE=1 CA_PRINT_OBJECTS=1 CA_PERF_WARNINGS=1 CA_PRINT_TRANSACTIONS=1
and then, when the problem manifests itself, open Console.app, start recording, and see if anything interesting related to CoreAnimation appears?

What do you mean by this? I guess Console.app is an OSX program. By "recording" do you mean take a Firefox performance profile? Where would I see logs like in the prior comment? I don't see anything like that under any of the tabs in the Console program.

Flags: needinfo?(mstange.moz)

Sorry, I should have said "streaming" instead of "recording". The Console program has a button "Start streaming". Once you've clicked that, the logs will appear. (Or at least that's how it works on macOS 11.5.)

Flags: needinfo?(mstange.moz)

I haven't been ignoring this bug. I've actually had the same browser session open since August 25th, and I haven't hit the problem again before today, oddly enough. I used to get it about once a day after a restart.

Anyways, I seemed to hit the issue or something like it after I opened this page, which is a standard-looking ad-laden new website. I don't know why it caused issues when all of the other ones I haven't looked at. The CPU usage went up to 150% or so, with none of it showing up in the profiler. https://www.realclearpolitics.com/articles/2021/09/01/poll_newsom_opens_bigger_lead_in_calif_recall.html

I saved the logs, but the only thing about CoreAnimation was this: "CoreAnimation buffers: currently 0.274MB; maximum 50.9MB" which appeared about 20 times.

For myself, the issue now seems less noticeable since I upgraded from macOS 11.5.1 to 11.5.2. Apple's patch notes for 11.5.2 are even more useless than usual, as they say only that it includes (unspecified) bug fixes.

There are various forum threads where other users have noticed that 11.5.2 does appear to fix high CPU usage in WindowServer and also multi-second input lag when pressing Volume Up / Down and Caps Lock, so I suppose there's some chance those fixes have helped here as well.

:mccr8, what version of macOS are you on now? I wonder if you upgraded to 11.5.2 after this was initially reported...

Flags: needinfo?(continuation)

Ah, interesting. I'm on 11.5.2 and I tend to upgrade rather quickly when OSX prompts me. It looks like 11.5.2 came out around August 11, so that could be the explanation of why I'm not seeing the really bad behavior I was before.

Flags: needinfo?(continuation)

Is anyone still seeing this issue? Other than the supposed macOS 11.5.2 fix, bug 1690687 should also have improved things here.

Depends on: 1690687

Please reopen if this is still a problem.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → WORKSFORME

Nope, I haven't had any problems since it went away earlier. Thanks!

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: