Open Bug 1347666 Opened 4 years ago Updated 2 days ago

Huff Post page load CPU usage much higher in Nightly vs 45 ESR


(Core :: General, defect)

Not set




(Reporter: Harald, Unassigned)


(Blocks 2 open bugs)


(Keywords: perf, Whiteboard: [qf:p3])


(1 file)

(Accessible behind VPN on http or auth0 on https):

Combined dashboard [1] shows that huff post article page loads about 35% slower in Nightly vs ESR 45.

Firefox 45:

Patrick McManaus looked at the results and pointed out that CPU usage is much higher in Nightly.

[1]: (pick uff post in the 'Aggregated boxplots over time' section to get clickable scatterplot)
its this page

and that's a 55 nightly.. not sure if we can get the build cset or not
Blocks: QuantumFlow
Whiteboard: [qf] → [qf:investigate]
Someone needs to capture a profile.
Profile from today's nightly on Win64:
(I'm new to this; please let me know if I captured the wrong thing.)
See Also: → 1337058
There is definitely something weird going on here but I don't think that's your fault!  There are a fair number of sync IPC messages going around here <> and I just refuse to believe that they all happen to finish in exactly half a millisecond, even though it seems like they all took one sample.  Was your sampling frequency 0.5ms?

Markus, this looks like a super fishy profile.  Can you please have a look?
Flags: needinfo?(mstange)
I think the profile looks fine, with the exception of the profile interval: It looks like the chosen interval was 0.5ms but the actual interval we were able to achieve was 2ms. I've seen this before on Windows profiles. Either we are slower to capture stacks on Windows, or we're running into timer resolution problems, or something else is happening.

The tracing markers at the top give you the exact times for the sync IPC messages. Most of them are below 0.5ms. For example I can see one GetCookieString message that took 0.11ms.
That's not hard to believe because the parent process looks completely idle.
Flags: needinfo?(mstange)
You're right -- I played around with the addon when I first installed it, and later forgot that I had set my sampling frequency to 0.5ms. Is there a recommended value that I should use for reporting profiles in bugs?
Usually 1ms, but since we apparently don't manage to go lower than 2ms on Windows, I recommend 2ms on Windows.
If the intended interval is close to the actual interval, the numbers on perf.html are more trustworthy and the graphs look better.
(In the thread stack graphs at the top, we draw each sample's bar with the width of the intended interval. In the profile you posted, if you zoom in a few times (by selecting a range and clicking the magnifying glass button), you'll see large gaps between the individual sample bars. If the actual interval and the intended interval match, these bars will be mostly adjacent.)
I filed bug 1350593 about the profiler resolution issue on Windows.  I was wondering why those lines look like that...

David, can you please re-profile?  Thanks!
Oh sorry, I thought I had already posted a new one.
Here's one at 2ms:
The ads took forever to load, maybe bad hotel wifi?
And that one seems to have a problem with symbols. Third time's the charm?
The reflow times may be worth looking into (no idea whether they are unusual or not.)  The paint times also seem really high, but it seems like they're all in some kind of JIT code?  And a lot in CrossProcessSemaphore::Wait().

Other than these, I don't see anything else too unusual.
Blocks: FastReflows
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(bas)
maybe run a profile of esr45 to see if it can repro the huge difference?
Does the profiler work on esr45?
You won't get C++ stacks from the official esr45 builds so it'll not be as useful. But it'll work if you compile the esr45 source as "nightly".
I'd be interested in the profile comparison too.

CrossProcessSemaphore::Wait() should just be an alternative blocking point (we previously blocked in a sync IPDL call when sending a layers transaction, now we block for the compositor to have finished with the transaction when we try paint next). I'd expect this to just move blocking time from one location to the other, and hopefully be a shorter wait.

If the semaphore time is actually worse, then we might have an actual regression where we're not releasing read locks when we should.
Flags: needinfo?(matt.woodrow)
Attached image cpu usage graphs
I am not particularly keen to build esr45 locally, so I took a profile with xperf.

At first glance, e10s seems to be slowing us down a lot, compared to nightly without e10s.
Does the Gecko Profiler not work in ESR45?
(In reply to Jonathan Watt [:jwatt] from comment #17)
> Does the Gecko Profiler not work in ESR45?

I wouldn't count on it.
(In reply to :Ehsan Akhgari (busy) from comment #11)
> The reflow times may be worth looking into (no idea whether they are unusual
> or not.)  The paint times also seem really high, but it seems like they're
> all in some kind of JIT code?  And a lot in CrossProcessSemaphore::Wait().
> Other than these, I don't see anything else too unusual.

During the one particularly long rasterize in the profile posted by David. The paint times just seem to be in D3D/D2D code, but nothing that makes it particularly obvious what kind of work is being done. From what I can tell it seems like the GPU was busy during that draw. CrossProcessSemaphore::Wait() can happen when the GPU is busy as well.

What I've seen for higher CPU usage when the GPU is busy, is related to the mQuery spin at the end of EndFrame, I have some patches locally to reduce that code's CPU footprint. However it should be noted they don't affect user facing performance (except maybe of other applications than Firefox)
Flags: needinfo?(bas)
Whiteboard: [qf:investigate] → [qf:p3]
Keywords: perf
You need to log in before you can comment on or make changes to this bug.