Closed Bug 1553074 Opened 5 years ago Closed 2 years ago

WebRender is using too much CPU (Radeon HD 6900? Lots of BG tabs?)

Categories

(Core :: Graphics: WebRender, defect, P3)

69 Branch
Desktop
Windows 10
defect

Tracking

()

RESOLVED WORKSFORME
Performance Impact medium

People

(Reporter: zxspectrum3579, Unassigned)

References

Details

(4 keywords)

Attachments

(6 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0

Steps to reproduce:

Just normal browsing.

Actual results:

  1. FireFox starts to occupy all cores of the CPU until the total reaches 100% and soon becomes barely responsive to any user interactions. It not technically freezing since it is still possible to exit the browser without killing it in Task Manager, even though it takes a few minutes for it to react to clicking on the "X" close button.
  2. At times the parasitic 100% CPU usage issue starts from the very beginning, since the moment when the session is still loading, which I have not seen previously in any many years, if ever. See the attached "parentBuildID" process which is the main culprit -- killing it manually via Task Manager immediately helps the situation. Then it gets recreated as in a small footprint process with almost no CPU usage.
  3. People suggested "refreshing profile", but from what I read it does not really help. ESPECIALLY since there is nothing wrong with the profile as it can be easily seen by running it on FF 66 release version. The profile is fine and not corrupted in any way.
  4. "About:performance" tab is useless in the situation. It does not show any of the tabs occupying more than "Medium" CPU usage occasionally, which is normal. And it does not show any Addons/Extensions having significant CPU usage either.

Expected results:

Normal browsing operation.

Component: Untriaged → Performance
Product: Firefox → Core

Bug #1550531 could be related to this, if you're closing one of Firefox windows (in some cases also tabs).

See Also: → 1550531
Summary: FireFox 68b2 becomes unusable due to parentBuildID process 100% core usage → FireFox 68b2, 68b3 becomes unusable due to GRE Omni.Ja parentBuildID process 100% core usage

This screenshot shows that all FF68b3 processes together occupy all of the CPU's resources during the launch and GRE Omni.ja parentBuildID process being the main culprit.

The startup goes very slow; when it finishes, the CPU goes down to normal levels. But when you try to do something with the tabs or links or basically anything, the process goes berserk again so the only way to really solve it is to manually kill it via Task Manager so it would be recreated a new then behave normally.

OS: Unspecified → Windows 10
Hardware: Unspecified → Desktop

Safe mode testing shows a warning for TabBrowser script being unresponsive.

This was not happening previously even when I tested with more than 1500 tabs, so it is looks like a regression.

But, of course, there are no addons/extensions loaded, so there is no Omni.ja process. The issue is that I am still unable to find anything specific that causes the issue.

*"there is no parentBuildID Omni.ja process"

Summary: FireFox 68b2, 68b3 becomes unusable due to GRE Omni.Ja parentBuildID process 100% core usage → FireFox 68b becomes unusable due to GRE Omni.Ja parentBuildID process 100% core usage
Whiteboard: [qf]

Could you try to capture a performance-profile while Firefox is being slow?

To do that, go to https://profiler.firefox.com/ and click the button to install the add-on.

Then ctrl+shift+1 to start profiling (or use the "start" button in the globe-shaped add-on performance-profiler icon in your toolbar)
Then wait a bit, and then ctrl+shift+2 to capture the profile (or the "capture profile" button in the add-on menu).

This should open a new tab which will eventually show a visualization of the performance profile (which may include URLs you're currently visiting, be aware). Then you can use the "Share" button at the top right of the URL there to upload it and get a public URL that you could provide here, which we can take a look at to see what's going on.

Thanks!

Flags: needinfo?(zxspectrum3579)

Alas, it is not feasible since the process makes the whole CPU, all of its cores 100% busy and the browser is not really responsive until you manually kill the aforementioned process in Task Manager. And when it gets automatically restarted after that it behaves properly so there is nothing interesting to record.

It would be great if the main FireFox process had a parameter to force the process to "profile" the GRE Omni.Ja parentBuildID process for a limited time. This would help Mozilla to save significant time in debugging obscure/complicated cases like this when it is not easy to do manually. Do you guys have something like this?

Flags: needinfo?(zxspectrum3579)

Hmm. If Firefox is completely unusable after the issue has happened, then yeah, I'm not sure it's currently possible to capture a profile (though ideally that's a use-case we should address to help diagnosing these sorts of issues).

I'm not sure what to suggest at this point, but I'm going to leave this in the [qf] triage and hope someone on the performance team might have some ideas.

We have a way to start and stop the profiler from the browser error console. Here's how you use it:

  1. Go to about:config and set devtools.chrome.enabled to true.
  2. When the browser starts becoming slow, open the browser console: hamburger menu -> Web Developer -> Browser Console
  3. Paste this into the textbox and press enter: Services.profiler.StartProfiler(1000000, 1, ["js", "stackwalk", "threads"], ["GeckoMain", "Compositor"]);
  4. Wait for a bit so that the profiler can record what's going on.
  5. Paste this into the textbox and press enter: await Services.profiler.dumpProfileToFileAsync("C:\\firefox-profile.json");
  6. Wait for it to print "undefined" and for the file to show up in C:\.
  7. Paste this into the textbox and press enter: Services.profiler.StopProfiler();

Then attach the firefox-profile.json file to this bug, or email it to me if it's too large.

I hope this works; if the browser is too busy to even open the error console then we're in a bad place.

"if the browser is too busy to even open the error console then we're in a bad place"

That was the case just before the arrival of version 69 to Developer Edition beta stage -- it looks like the browser is not 100% busy any more, so I will try to record something.

When I tried to execute "Services.profiler.StartProfiler(1000000, 1, ["js", "stackwalk", "threads"], ["GeckoMain", "Compositor"]);" method I got the following:

"NS_NOINTERFACE: Component returned failure code: 0x80004002 (NS_NOINTERFACE) [nsISupports.QueryInterface]" @stack-trace-collector.js:75

Still, I tried all of the following steps and "firefox-profile.json" was not created.

Is a fix for stack-trace-collector.js expected?

mstange, see the question above ^

Flags: needinfo?(mstange)

I've filed bug 1565343 about the stack-trace-collector message. I see it to, but on my machine it doesn't prevent StartProfiler from working.

But if "firefox-profile.json" does not get created, I don't know why that is, and I'm afraid I've run out of ideas for how to get the information we need. Sorry

Flags: needinfo?(mstange)
Component: Performance → Performance Tools (Profiler/Timeline)
Product: Core → DevTools

Moving this to the profiler team to see if they have any ideas.

Component: Performance Tools (Profiler/Timeline) → Performance
Product: DevTools → Core

Actually I take that back. Going to NI some people here instead.

Flags: needinfo?(florian)
Flags: needinfo?(felash)

Hey User Dderss,

Can you set the following environment variables:

MOZ_PROFILER_STARTUP=1
MOZ_PROFILER_SHUTDOWN=Some path

Where some path is some known path on your file system? With those set, I believe Firefox will run the profiler automatically on startup, and should attempt to dump a profile automatically on shutdown.

So if you start with these environment variables, and then see the problem, and then immediately exit the browser, presuming that the exit occurs properly, you should get a profile written to the path you set as MOZ_PROFILER_SHUTDOWN. If you can get that and attach it to this bug, that'd be a huge help in figuring out what's going on here.

Flags: needinfo?(zxspectrum3579)

Thanks, Michael, but, just in case: will it be anonymized (tab URLs, form data, so on)? If not, what environment variable(s) should I use to achieve that?

Flags: needinfo?(zxspectrum3579)

(In reply to User Dderss from comment #16)

Thanks, Michael, but, just in case: will it be anonymized (tab URLs, form data, so on)? If not, what environment variable(s) should I use to achieve that?

If you drag the file that's generated onto https://profiler.firefox.com, and then click "Publish", you can choose which things you'd like to publish along with the profile, and then send us that link instead rather than the raw file.

Does that help?

Flags: needinfo?(zxspectrum3579)

Thanks but it did not get to this stage just yet.

I created both variables (just in case, both for the system and my user account during various attempts) but nothing was written upon exiting. There were no errors during existing, including in Windows' event logs either for application or for the system so the issue is that FireFox.exe was not able to write the profiler data -- even though I entered my regular "downloads" folder path to which FF is proven to have full access.

Is there an internal FireFox error log file for such things (for errors that do not go to Windows' event log hence are not there) to look at to see why it all fails?

Or it actually is not necessary since FF can stumble into some JS error trying to do the profiling (something like I cited above: "NS_NOINTERFACE: Component returned failure code: 0x80004002 (NS_NOINTERFACE) [nsISupports.QueryInterface]" @stack-trace-collector.js:75")?

Flags: needinfo?(zxspectrum3579)

Moving the NI request to Greg while I'm in PTO.

Flags: needinfo?(felash) → needinfo?(gtatum)

I think the MOZ_PROFILER_SHUTDOWN variable takes a file name, not a path. The file will be written to the current directory (when launching Firefox from a command prompt, the current directory can be set with the 'cd' command before launching Firefox). The current directory needs to be writable by the user starting Firefox.

Flags: needinfo?(florian)

I can confirm on my side that the stack-trace-collector.js error does not get in the way of collecting a profile from the steps outlined in comment 8.

Flags: needinfo?(gtatum)

(In reply to Florian Quèze [:florian] from comment #20)

I think the MOZ_PROFILER_SHUTDOWN variable takes a file name, not a path. The file will be written to the current directory (when launching Firefox from a command prompt, the current directory can be set with the 'cd' command before launching Firefox). The current directory needs to be writable by the user starting Firefox.

You are right, thanks; I added a filename and it has worked.

Here is a link to the profile I recorded (with hidden threads, hidden time range and extension information): https://perfht.ml/2JGrOHW

While it is perfectly understandable that extensions might easily slow down the startup of the browser, it is interesting to know:

  1. what has happened between FF 67 and 68 (version 69 has improved it a bit thanks to process priorieties finally set right for Windows) what has caused things going so bad so dramatically with the way some extensions work and block the core FireFox.exe process/GUI to a point when it becomes so unresponsive?
  2. what should be done to "about:performance" page so it would start to be useful for such situations? It should be able to show what is happening with all of the FF processes load, including GRE Omni pile and which extensions exactly abuse the background operations. If I understand it correctly, currently "about:performance" is only able to show overt extensions load and this is why it was useless for me when I had FF taking full 100% of all cores and yet the page has shown to me that everything is fine with none of the extensions popping up in red colour with high load.

Thank you in advance.

One thing jumps out immediately from the profile in comment 22 - the parent process is completely blocked waiting for messages to return from the compositor.

I don't believe the compositor process is being sampled here.

Dderss, can you please supply the text from about:support (by visiting about:support and choosing "Copy text to clipboard")?

Flags: needinfo?(zxspectrum3579)
Attached file about:support data

Thanks, Dderss. Does the behavior improve if you disable WebRender? You can disable WebRender by setting the gfx.webrender.force-disabled pref to true in about:config and restarting.

Indeed, the profile shows the last 24 seconds of a very lengthy shutdown process. The shutdown is slowed down by two sync IPC calls to the compositor: "PAPZInputBridge::Msg_ProcessUnhandledEvent" and "PCompositorBridge::Msg_WillClose". We don't know the duration of the first sync call, but the duration of the WillClose call is exactly 10 seconds. I wonder whether that's some magic sync IPC timeout value? Or does the GPU process actually get around to respond after that time?

(In reply to Mike Conley (:mconley) (:⚙️) from comment #25)

Thanks, Dderss. Does the behavior improve if you disable WebRender? You can disable WebRender by setting the gfx.webrender.force-disabled pref to true in about:config and restarting.

Thanks for the prompt, it has solved the issue. The GRE Omni process feels fine so far -- no matter what, its CPU load does not surpass 2%.

Why do you think WebRender is misbehaving?

Flags: needinfo?(zxspectrum3579)

I'm not sure, but our Graphics folks might be able to give you some diagnosis steps. Moving to the right component.

Component: Performance → Graphics: WebRender

The good thing that the issue does not seem to ever come up after I kill the GRE Omni process and it gets started automatically restarted, so it seems like the problem is somehow tied to startup operations.

bug 1566206 was fixed in Beta 69.0b7 and Stable 68.0.1: Can you still reproduce this problem after updating?

Jan, my FF 69.0b7 says it is perfectly updated and there is nothing newer. (I am working in the English version so there should not be any language pack/compilations delays or anything.)

Where do I load 69.0b7 version? Thanks in advance.

*my version is FF 69.0b5

Oh, I copied a typo. The fix is included in 69.0b6. You should get an update in the next hours, otherwise it's already available here: https://download.mozilla.org/?product=firefox-beta-latest-ssl&os=win64&lang=en-US

Blocks: wr-69

User Dderss, can you try turning WebRender back on to see if the problem returns?

Flags: needinfo?(zxspectrum3579)

Thanks, Jeffrey.

I updated FF to 69.0b6, and activated back WebRender.

The good thing is that the process does not occupy CPU resources if I am not interacting with the browser -- it is already a progress.

But it sadly makes GRE Omni process busy if I do anything with the browser at all. Just typing this text makes it occupy up to 12% of CPU resources and if I am loading a new page it rockets to 30% and more until 100% of CPU is busy. It quickly goes down but it still I doubt that it should be this way.

Flags: needinfo?(zxspectrum3579)

Jeffree, did you hear about this issue?

The screenshot shows two states of the browser: smaller window is a starting point with renderer working fine and the big background window with blank viewport is when the user resizes the window (which is reversible).

Flags: needinfo?(jmuizelaar)

(In reply to User Dderss from comment #37)

Created attachment 9079423 [details]
FireFox 69 rendering disappears when the width of the windows is increased from a small size and reappears once the window gets smaller again.png

Jeffree, did you hear about this issue?

No. I haven't seen anything like this. Can you file a new issue for that?

Flags: needinfo?(jmuizelaar)
See Also: → 1567583
Priority: -- → P2
Whiteboard: [qf] → [qf:p2:resource]

Heyo following up on this.

We got a graphics card from the same family as yours (A Radeon HD 6950, which is a slightly worse version of your HD 6970). Sadly, I don't see any issues when I boot up a fresh profile of 69.0b5 with webrender enabled. The steps I used to try to reproduce the issues you described are:

  • load up a bunch of tabs (youtube videos, articles, etc)
  • resize the window a bunch of times
  • type text into random fields

Everything remained snappy and well-rendered under these conditions.

Just confirming: did we actually conclude that your (~30) extensions aren't causing any problems (for webrender)? Notably if you do launch the browser in a "refresh"/"safe-mode" state, it tends to disable webrender (and it's hard to notice).

If not, here's some simple tests to get more information on your extensions:

  1. go to about:profiles, create a new profile (top left), and select "launch profile in new browser" (all following steps are in new browser)
  2. go to about:config and enable "gfx.webrender.all"
  3. go to about:profiles and select "restart normally..." (top right)
  4. go to about:support and confirm that it says "Compositing WebRender"
  5. do your normal reproduction steps. If you still see the issue, we are certain it's not your extensions or anything else. Good!
  6. if you don't see the problem anymore, then there's a good chance it's one/multiple of your extensions misbehaving! Good!
  7. if the latter is the case, and you're willing to do a bunch of boring work to help isolate it, we can try to "binary search" your extensions. Here's the basic idea:
    1. you can do this in your fresh profile (requires reinstalling all of the extensions) or your normal one (might be annoying to get back to where you started) -- hopefully you should only need to install/test the set of extensions you currently have enabled. (either way, it might be necessary to keep track of the enabled extensions at each step to avoid getting lost)
    2. turn off half of your currently enabled extensions. Do a "restart normally..." to clean out their state.
    3. if you still see the problem, then this half of the extensions contains the problem(s)! Repeat step (i) (turn off another half)
    4. if you suddenly don't see the problem, then the problem is in the half you just disabled. Disable everything, re-enable that half, and repeat step (i)
    5. ideally, this repeats until you're down to 1 extension, which we now know is the issue
    6. if somewhere in the middle it randomly stops happening, then it's a combination of extensions that are currently enabled and the ones you just disabled (ouch!). You can try to get lucky and pick new random "halves" of those extensions, or you can give up here and just tell us the set your reduced it down to.

Also: have you at all noticed any particular sites or conditions that this happens more under?

  • Does it happen with only one tab? Only with lots of tabs?
  • Does it take a while to happen, or does it happen right away?
  • Do any websites have a tendency to trigger the issue faster or more often? (youtube? twitter? amazon?)
  • Are you using multiple monitors? Does it happen without them?
  • Are your monitors notable in anyway? (freesync? really high resolution?) (sadly these details seem to be absent from about:support)
Flags: needinfo?(zxspectrum3579)

Thanks, Alexis.

You were not able to reproduce the issue since you did not have many tabs. In previous FF versions, I was able to run up to 1500 tabs with no issues (of course, they were only in click-to-load mode), but since recently we see a regression I detected while running the browser in Safe Mode firstly in FF 69b3, and, just to be sure, now again in FF 69b10 with the same result: unresponsive TabBrowser-tabs.js script warning. I have never seen it in many years, if ever.

In Safe Mode, any graphics-related activity occupies a full core 100% just by e.g. playing a video, which obviously should not happen -- especially since, as you correctly mentioned, WebRender is switched off in SafeMode, it has "Basic" on in the "Compositing" feature.

So before we even deal with WebRender, can you please clarify if e.g. a video playback expected to be this slow in Safe Mode/Compositing=Basic?

If it should not be really this slow, then maybe the whole thing is actually an AMD driver issue for HD6900 family of GPUs? The software status says that everything is fine, the driver version is 15.301.1901, which is the latest for this family of GPUs. The driver will be never updated/fixed as it is old but with previous versions of FF there were no such issues with the exact same driver which is not updated for many years already. The whole W10 system was fresh before and it is fresher now with my upgrade to W10/1903.

Flags: needinfo?(zxspectrum3579)

(coming back from a long weekend, sorry)

Yes, in Safe Mode you disable all hardware acceleration, which is to say we won't use your GPU at all. Video performance is heavily reliant on using the GPU regardless of if webrender is enabled or not. This is why I suggested creating a fresh profile and enabling webrender in it (requires a restart of that profile). Safe Mode won't give us any useful information, it's just there for emergencies.

Thanks, Alexis.

I did the experiment and found out that FF 69.0b10 is slow even with a new profile with no extensions other than Gecko Profiler:
https://perfht.ml/2T9GgLp

I recorded simple scrolling of this very page and saw up to 5-7% of CPU usage because of this what should not happen. This is a dead HTML page, so there are zero reasons why it should rationally occupy this much of the resources unless we deal with a weird issue of WebRender and AMD's Radeon HD 6000/6900-family drivers, right?

Flags: needinfo?(a.beingessner)

Scrolling this page using up 5-7% of cpu is a pretty far cry from the issue I thought I we were trying to solve? Earlier, you were describing interacting with the browser causing 100% load, making it unusable, as well as window resizing becoming corrupted. Am I misunderstanding? Is that not a problem at all anymore?

(This isn't a particularly static page, there are plenty of clickable things.)

If this is just about slightly excessive cpu usage now, I would need some measurements of a fresh profile without webrender to understand how much worse we're doing. Also it would help to know what CPU you have (want to know how beefy it is).

Flags: needinfo?(a.beingessner)

It is the same issue. It is not a far cry since in my test I only had three tabs open, unlike my working profile with a huge number of tabs. Also, this current page does not have live animations/transitions so in terms of rendering it is all but dead HTML. My CPU is Intel i7-3770@3.6 GHz. A process taking up to 7% of this CPU means that it occupies up to 50% of one a logical core just because of scrolling. If you scale up this to my actual profile, it becomes a disaster: even when the browser is not feezing it occupies up to 80% of all of the CPU by my any activity in the browser. It does completely calm down if I am not typing anything in this window and now scroll through this page.

Switching off WebRender was already tested, see above; it did solve the issue. The CPU load dropped to 1-2% maximum.

So what could be causing this abnormally high CPU usage?

Thanks in advance.

Ok I had confused myself a bunch because this thread is massive and things kept changing. Here is my summary of the current situation:

The only problem being reported in this issue now is that webrender is too CPU intensive, especially with a heavy tree-style-tabs workflow. The user has ~1500 tabs, but most of those are "hibernated" on startup and never touched. However they likely accumulate non-hibernated tabs much more quickly than an average user, because that's the tree-style-tabs workflow!

It is therefore possible that when webrender is enabled, we are doing more CPU intensive work that is somehow exacerbated by non-hibernated background tabs. Anecdotally I do think I saw this a few times while trying to see way more disastrous perf. With a 8 core (16 logical core) CPU, I remember seeing roughly the following in the OS CPU metrics (CTRL+SHIFT+DELETE):

  • 100% cpu while resizing the window
  • 50% cpu while changing/loading tabs
  • 20% cpu while scrolling bugzilla
  • 10% cpu idle

These numbers aren't shocking, but perhaps they can be improved. Notably the user reports far better CPU usage with webrender disabled (didn't verify this myself).


Things That Previously Happened In This Thread That No Longer Matter:

In 68/69 beta the user was experiencing some sort of busy-loop deadlock. This went away ~3 weeks ago, which strongly suggests they were running into Bug 1566206, which was fixed and uplifted ~3 weeks ago.

Status: UNCONFIRMED → NEW
Ever confirmed: true
Priority: P2 → P3
Summary: FireFox 68b becomes unusable due to GRE Omni.Ja parentBuildID process 100% core usage → WebRender is using too much CPU (Radeon HD 6900? Lots of BG tabs?)
Version: 68 Branch → 69 Branch

Alexis, thanks, you did a great summary. Indeed, thanks to version 69 better processes priority for Windows and possibly other fixes I no longer experience nearly complete freezes of the browser with WebRenderer enabled, my only current issue now is high CPU load when I do anything at all with the browser.

Could you also check what is happening during start-up? I currently have 1300+ tabs (almost all of them dormant though) in the profile but even when I had e.g. 1500+ tabs in earlier versions than FF 67 or so I never had an issue starting in Safe Mode. And in all of the latest versions, I am getting guaranteed "TabBrowser.xml:2098" unresponsive script warning. Considering the fact that those tabs are click-on-load, why the browser is wasting so much time on them since recently? Also, can we switch it off completely for the tabs that are not actually loaded and just do not process them? They just never need the whole malloc etc, they only need to restore URLs with tab icon and do not load form data and everything else until it is actually necessary. This way the browser will start much faster and will occupy much less memory.

No longer blocks: wr-69

Hi!
This problem continues with version 70.0b5.
Problem arises every time when I inspect some web page. Few minutes after i scroll around developer tools, -greomni omni.ja processes take all my cpu and all firefox tabs freeze up.
It doesn't matter if gfx.webrender.force-disabled parameter is true or false.
(Ubuntu 16/18, graphic card: Mobility Radeon HD 4330/4350/4550)

Performance Impact: --- → P2
Whiteboard: [qf:p2:resource]
Severity: normal → S3

Does this still reproduce for you when using a current version?

Flags: needinfo?(zxspectrum3579)
Flags: needinfo?(mkancija)

Can't say, outdated hardware.
I no longer have this computer or hardware to reproduce this problem.
No problem on nvidia geforce gtx.

Flags: needinfo?(mkancija)

I do not have this video card any more, but right before I let it go, I already did not experience those crazy CPU loads on any action, so the issue might have been solved. To be sure, only current HD 6900 card owners can say affirmatively, but this is my guess.

Flags: needinfo?(zxspectrum3579)

Thanks for the update

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: