Firefox spinning at 100% CPU, very janky. Have profile, but don't know how to read it.

RESOLVED WORKSFORME

Status

()

RESOLVED WORKSFORME
6 years ago
6 years ago

People

(Reporter: justin.lebar+bug, Unassigned)

Tracking

Trunk
x86_64
Mac OS X
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

6 years ago
My Nightly build was spinning at 100% CPU and was very janky.  It did this for about 5 minutes, then suddenly stopped.

All I had open was a few bugzilla tabs, gmail, google calendar, and an nytimes tab which I closed partway through.

I don't know how to read this profile, which shows very little time spent in Gecko.

BenWa, can you assist here?

http://people.mozilla.com/~bgirard/cleopatra/#report=60cd531561f0bfd0f77263a08d4773cb0de4061d
Most of the time is spent in Mutex like functions but it's not spent sitting idly in a lock nor does it look like a spin lock.

Looks like the contention is in memory barriers, memmove's and pthread_*. It *may* be swapping but I'm not sure of that either. A theory is something was spamming Mutex calls.

Any idea Rafael what would be causing these functions to be slow? Hit 'Invert Callstack' when opening the profile above.
Flags: needinfo?(respindola)
(Reporter)

Comment 2

6 years ago
It just started happening again (but went away too quickly for me to profile this time, sorry).  I was definitely not swapping.  But of course this time it wasn't necessarily the same cause as last time.
> Looks like the contention is in memory barriers, memmove's and pthread_*. It
> *may* be swapping but I'm not sure of that either. A theory is something was
> spamming Mutex calls.
> 

What do you mean my "memory barriers"? This is x86, right? The cpu has total store order, so memory barriers only exist at the compiler level.

If I read the profile correctly the leaf with the most time is 64 bit compare and swap. Is that what you see?

My guess is that the compare and swap itself is not slow, but that we are creating a large number of events and the OS is using the CAS to implement locks when walking the events.

I have never coded with cocoa, but most of the event related names are fairly generic, so it is hard to test this theory.  The only ones with somewhat descriptive names are those with "TrackingArea" in it. Total guess, but maybe this is graphics related?
Flags: needinfo?(respindola)
(Reporter)

Comment 4

6 years ago
I just filed bug 824658, which appears to be a different bug with the same symptoms.  Unless PNG decoding is hiding somewhere in this profile?
> The only ones with somewhat descriptive names are those with
> "TrackingArea" in it.

NSTrackingArea objects are associated with NSView objects, and are
used to track -[NSView mouseEntered:], -[NSView mouseExited:],
-[NSView mouseMoved:] and -[NSView cursorUpdate:] "messages".
(https://developer.apple.com/library/mac/#documentation/Cocoa/Reference/NSTrackingArea_class/Reference/Reference.html)

The profile from comment #0 says 5.7% of CPU time is spent in
+[NSEvent _discardEventsForTrackingArea:].  An NSView's NSTrackingArea
objects get removed when the NSView changes size or is deleted.  So I
suppose lots of NSViews are getting resized or deleted.

But 5.7% isn't a large number.  And neither is the 16.1% of CPU time
spent in OSAtomicCompareAndSwap64Barrier$VARIANT$mp.  These figures
don't explain the jankiness by themselves.  Instead I suspect the
jankiness comes from lots of "little" hangs.

I doubt that it'll be possible to unearth the causes of these hangs
from this profile (or any profile).  Instead I think we'll need a
bunch of gdb stack traces taken while the "janks" are happening.

The profile shows us there's contention between threads.  But we need
to know exactly what kind of contention, between which threads.

> and an nytimes tab which I closed partway through

This may explain the large number of calls to +[NSEvent
_discardEventsForTrackingArea:].
(Reporter)

Comment 6

6 years ago
I think focusing on the fact that the event loop is occasionally "hung" ("janks") is perhaps a red herring.  Note that we also spin at 100% CPU here; if we fix that, it seems likely to me that the hangs will disappear.

> Instead I think we'll need a
> bunch of gdb stack traces taken while the "janks" are happening.

To be clear, you want a bunch of all-threads stack traces taken using GDB?

> The profile shows us there's contention between threads.

I agree this is a possibility, but I don't follow how you conclude this with relative certainty.  CAS can be expensive even if there isn't any thread contention, right?
> To be clear, you want a bunch of all-threads stack traces taken
> using GDB?

Yes.

> I agree this is a possibility, but I don't follow how you conclude
> this with relative certainty.  CAS can be expensive even if there
> isn't any thread contention, right?

If the jankiness was entirely caused by one or a few "expensive"
operations, I'd expect the percentage of CPU time eaten by that (or
those) operations to be much higher (to be much closer to 100%).

Of course I'm assuming that CAS would show up as one or just a few
"items" in the list.
> Note that we also spin at 100% CPU here

What makes you say that?
By the way, Justin, I'd be curious just to see another profile.  This time make sure not to close a tab just before taking it.  I strongly suspect it'll look very different from your previous profile (that there won't be any "trackingarea" references in it).

I'd still like to see a bunch of all-threads stack traces taken using gdb, and I still suspect only they will truly be helpful.  But I do understand that won't be an easy task to accomplish.

Best of all would be if you could find reliable STR.
(Reporter)

Comment 10

6 years ago
>> Note that we also spin at 100% CPU here
>
> What makes you say that?

Comment 0 and the bug summary.  :)

I'll try to get another profile for you.  FWIW I've seen basically this same profile multiple times before.  Maybe it always happens after closing a tab; I dunno.
(Reporter)

Comment 11

6 years ago
> If the jankiness was entirely caused by one or a few "expensive"
> operations, I'd expect the percentage of CPU time eaten by that (or
> those) operations to be much higher (to be much closer to 100%).

But isn't it possible that we're just running this Cocoa code over and over again, and that the expensive part of that code is a CAS, so that's what shows up in the profile?

FWIW if we want an all-threads profile, I'll try to capture this in Instruments.
>>> Note that we also spin at 100% CPU here
>>
>> What makes you say that?
>
> Comment 0 and the bug summary.  :)

I meant what made you conclude that FF was eating 100% CPU. :-)

top?  Some other program?
> FWIW if we want an all-threads profile, I'll try to capture this in
> Instruments.

I frankly don't think *any* profile will be much use.

But for the purposes of comparison I'd like to see another Cleopatra
profile.  That is unless this bug only happens when you close a tab --
in that case don't bother.

What I'd *most* like to see, though, is reliable STR.
(Reporter)

Comment 14

6 years ago
> I meant what made you conclude that FF was eating 100% CPU. :-)

The activity monitor showed the Firefox process using 100% CPU.
(Reporter)

Comment 15

6 years ago
Created attachment 697612 [details]
Another profile

I captured another profile.  It seemed to start spinning while I was not using the browser, and it seemed to stop spinning soon after SPS finished doing its analysis.  Perhaps opening a new tab (as analyzing in SPS does) causes the loop to stop.

I'm having trouble getting SPS to upload the profile (and downloading the profile didn't generate a useful file), but I've attached a screenshot of Cleopatra.  Hopefully that's helpful.
I can't make anything of the information in the screen shot.  And in any case there's lots of information in these profiles -- much more than can be captured in any single screen shot.

So please do try to upload the profile you took, or another profile if you manage to take one later.

And as you keep seeing the problem, hopefully you'll be able to figure out how to reproduce it reliably.

By the way, I assume the current jankiness *wasn't* triggered by closing (or resizing) a tab or window.  Is that correct?
(Reporter)

Comment 17

6 years ago
> And in any case there's lots of information in these profiles -- much more than can be 
> captured in any single screen shot.

There appear to be multiple bugs in SPS here -- one preventing me from uploading the profile, and another preventing me from downloading the profile.  BenWa and I spent some time on this and didn't get very far.  Hopefully the next profile I take won't have this problem.

I think we need to think outside the "steps to reproduce" box here, particularly if this bug is happening while I'm not interacting with the browser.  I understand that STR would be helpful, and I'll do my best to get them for you, but I think it would be helpful if we'd entertain the question of how this might be happening, absent STR.
(In reply to Justin Lebar [:jlebar] from comment #17)
> > And in any case there's lots of information in these profiles -- much more than can be 
> > captured in any single screen shot.
> 
> There appear to be multiple bugs in SPS here -- one preventing me from
> uploading the profile, 


That's not a bug but a size limitation with profile over 10MB. You can decrease the size of the profile to a smaller size and it will upload successfully.


> and another preventing me from downloading the profile.
> BenWa and I spent some time on this and didn't get very far. 

Did you see my update on IRC? The link where you're hosting the profile doesn't have CORS headers so cleopatra doesn't have security permission to fetch the profile cross domain. Adding the .htaccess to your people folder will fix this problem. I modified the cleopatra to give a warning.
Justin:  If/when you manage to upload another jank profile, please also upload another profile (taken around the same time, in the same "context") without the jank.

These profiles are complex, and it's hard to tell what's "normal" from what isn't.  Having a "control" should help with this.
(Reporter)

Comment 20

6 years ago
> Did you see my update on IRC?

No, I didn't; sorry, and thanks for figuring this out!  Here's the working profile.

http://people.mozilla.com/~bgirard/cleopatra/?search=&customProfile=http://people.mozilla.org/~jlebar/profile-20130101-bug824583.json#
I think the profile would be more useful if you uncheck the non default 'jank' when profiling. The profile you posted doesn't record samples while the browser is responsive so it doesn't show a continuous time line.
By the way, Benoit, do you know of a program that can "diff" two profiles?  Might you consider building this capability into Cleopatra?
(Reporter)

Comment 23

6 years ago
> I think the profile would be more useful if you uncheck the non default 'jank' when profiling.

The "jank only" checkbox is not currently checked on my mac, and I'm pretty sure I didn't touch it between now and then.  Are you sure this is a jank-only profile?
(In reply to Steven Michaud from comment #22)
> By the way, Benoit, do you know of a program that can "diff" two profiles? 
> Might you consider building this capability into Cleopatra?

I'm working on it but it's still early. Feel free to give me feedback but I don't have many cycles at the moment to work on it. You'll see a 'Compare' button at the bottom left of the UI which you can load a local profile manually (or from your local storage if you loaded the profile recently). Later it should support URLs. Right now the basic idea is one view will match the selection of the other so its easy to compare samples and check their comparative size and sequence ordering from the green bars.
(Reporter)

Comment 25

6 years ago
I'm going to hope that this is just bug 824658 in disguise.
(Reporter)

Updated

6 years ago
Depends on: 824658
> The only ones with somewhat descriptive names are those with
> "TrackingArea" in it. Total guess, but maybe this is graphics
> related?

This was just a guess.  Comment #5 shows that it's wrong.

> I'm going to hope that this is just bug 824658 in disguise.

So this is extremely unlikely.
> So this is extremely unlikely.

Or maybe not.  There's not yet any relevant information here, so I suppose it might be related to anything.
(Reporter)

Comment 28

6 years ago
Let's just call this WFM until I see this in a build which has bug 824658 fixed.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.