Open Bug 1467697 Opened 6 years ago Updated 2 years ago

Loading page is x2 slower than chrome

Categories

(Core :: JavaScript: GC, defect, P2)

20 Branch
x86_64
Windows 10
defect

Tracking

()

Performance Impact medium
Tracking Status
firefox62 --- affected
firefox63 --- affected
firefox64 --- affected

People

(Reporter: alice0775, Unassigned)

References

(Depends on 3 open bugs, )

Details

(Keywords: parity-chrome, perf, perf:pageload)

Attachments

(1 file)

STR:
Open http://www.elegantreader.com/item/14804993

Actual Results:
It takes about 30 seconds to complete loading and rendering on Nightly62.0a1.
In contrast, less than 10 second on Chrome68.

Gecko Profiler log : https://perfht.ml/2JkpiII

Expected Results:
Faster ;)
Attached file about:support
Gecko Profiler log : https://perfht.ml/2JkB2uD
https://perfht.ml/2xWVv3C
appears to be a JS bug.
Flags: needinfo?(jdemooij)
Is it better if you set javascript.options.asyncstack to false?

It takes ~5 seconds for me to load that page, definitely not 30.
Flags: needinfo?(jdemooij)
(In reply to Jan de Mooij [:jandem] from comment #4)
> Is it better if you set javascript.options.asyncstack to false?

It seems nothing change.

> 
> It takes ~5 seconds for me to load that page, definitely not 30.

It seems to need low speed network in order to reproduce the Firefox's slowness.
My env is ADSL Down 5.0 Mb/s  Up 0.6 Mb/s  (Measurement result of http://www.bspeedtest.jp).
Hm but in that case it might be more of a networking issue...
Component: General → Networking
None of these profiles have any networking stuff showing up in them, from my look at things (profiles don't even show the socket thread, for example). Add to that, I can't reproduce (even on my connection that is almost as slow as the reporter's), so I can't collect data locally.

I will note that in the network markers tab, none of the loads takes particularly long. I don't really know where to go from here, but there is zero evidence that this is a necko issue.
Component: Networking → General
More remarkable regression range than the above:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2e70b718903a&tochange=bfd85c9652fa
Version: 12 Branch → 20 Branch
Let's see if Quantum Flow triage turns anything up here.

(Note: I'd like to remove the regression keyword since a 6 year old regression isn't something we can easily backout :)
Whiteboard: [qf]
There is some really weird GC stuff going on here.  The execution of JS is being repeatedly interrupted by GcSlices with FULL_SLOT_BUFFER as a reason.  Also the main thread is spending a lot of its time just waiting around for another thread to wake it up, which suggests synchronous waits on _something_ (not sure what exactly).

Paul can you make sense of what's going on with the GCslices here?  Are we running into some pathological case where we keep running GCSlices that don't do much work?
Flags: needinfo?(pbone)
I can reproduce that this is slower in Firefox than chrome.  At least twice as slow.

There does seem to be a lot of GC activity, that's not always a bad thing but it's probably a big clue in this case.

TLDR
====

We need to test this without nursery poisioning to see if it's still a real problem and how big the problem is.

First problem
=============

Being a nightly version of Firefox we poison the nursery each time we collect it, this is showing up on profiles and counting for (eyeballing from the profile) about half of the nursery collection time.  It'll also destort other parts of the profile.  I verified this by testing in the release version of Firefox and found that the page was faster there (but probably still room for improvment).

Alice0775 you may wish to re-test with a release or beta version of firefox.

If you control the site elegantreader.com, Could you also use Performance.mark / Performance.measure to annotate this part of the loading time, then we can see these in profiles and get an accurate measurement of the loading time.  I'm not aware if you can measure from page load.

I will file a seperate bug to allow people testing Nightly to control the poisoning behaviour so that they can gather profiles where this doesn't affect performance.

Next problem
============

All the provided profiles, plus my own profile, show lots of minor GC activity, particularly due to full remembered 
sets.  (This is partly why the poisoning shows up, it wouldn't if the nursery wasn't so busy.)  The tenure rate is also high-ish.

My guess is that there are lots of tenured->nursery pointers being created, potentially due to pre-tenuring of objects that shouldn't be pre-tenured.  And that this is making generational GC less affective, making us collect the nursery more frequently and keepign things alive in the nursery longer than they should be (which means slower nursery collections).

What we'd like to see (from the GC) for a program that creates this much GC activity is the same amount (or slightly less) nursery collection which tenures very few of the objects, meaning the nursery collection is much faster.  But first wwe'd have to take a deeper look at some profiles without the poisoning.  So that's a later step.

Misc
====

The provided profiles show a lot of page faults but mine doesn't.  But I can still reproduce the slowness in firefox so 
this is either unrelated or windows (where the other profiles are generated) is more honnest about reporting page 
faults (or we don't collect that on Linux).

Kannan ITYM GCMinor not GCSlice, but let me know if I'm mistaken and missed something.

Also Kannan, does a minor collection cause (always cause?) a bailout from the JIT?  Do you loose the JITed code and have to re-compile it?  Is it a problem or can it be avoided?

I agree with Overholt, the regression ranges arn't all that useful.  There are two GC related changes there both of them fix crashes/bugs.  And so much will have changed algorithmically since then that I doubt there's anything useful there.
Component: General → JavaScript: GC
Flags: needinfo?(pbone) → needinfo?(kvijayan)
Keywords: regression
Depends on: 1471467
(In reply to Alice0775 White from comment #9)
> More remarkable regression range than the above:
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=2e70b718903a&tochange=bfd85c9652fa

FWIW, via local build, I finally got https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=fd8fd1a7aecd&tochange=9de611848111
Interesting.

NI Olli, I know this change is 6 years ago.  But can you think of why this change could make the GC behave in this way?  Could it create a large number of tenured->nursery pointers, or even re-write a single tenured->nursery pointer many times?

I'll have a scan through the patch myself, but this isn't code I'm familiar with.

Thanks.
Flags: needinfo?(bugs)
Here are some times I've taken, with the gecko profiler set to sample every 0.4ms to increase the accuracy of catching the poisioning.  Note that measurement is difficult since I don't have a great way to measure things like this (maybe that's something I don't know how to do without Performance.mark/measure)

Nightly: 11sec
Nightly no poison: 12sec

Alice0775 could you also test to see if this is still a problem when nursery poisoning is disabled.  Please add JSGC_DISABLE_POISONING=1 to firefox's environment.  For example on Linux I run:

  $ JSGC_DISABLE_POISONING=1 ./firefox

If you can also provide the profile for this execution I can check to see if it truly disabled poisoning.

Please also re-test in a release version of firefox and see if it's the same.

If it's the same in both cases then this is something we'll want to investigate further.
Flags: needinfo?(alice0775)

(In reply to Paul Bone [:pbone] from comment #15)
> 
> Alice0775 could you also test to see if this is still a problem when nursery
> poisoning is disabled.  Please add JSGC_DISABLE_POISONING=1 to firefox's
> environment.  For example on Linux I run:
> 
>   $ JSGC_DISABLE_POISONING=1 ./firefox
> 

Step:
Open command prompt
cd (path to install directory)
set JSGC_DISABLE_POISONING=1
.\firefox.exe

The time it takes to display the page seems to be almost unchanged.

> If you can also provide the profile for this execution I can check to see if
> it truly disabled poisoning.
> 
> Please also re-test in a release version of firefox and see if it's the same.
> 
> If it's the same in both cases then this is something we'll want to
> investigate further.

Nightly62.0a1 (20180627222831) approx. 35sec https://perfht.ml/2MvpVMM
Firefox61.0.0 (20180621125625) approx. 20sec https://perfht.ml/2Kq2SlJ
Flags: needinfo?(alice0775)
I don't know why making MutationObserver to use webidl bindings would cause this.

Is the site perhaps using MutationObservers as microtask runners in the old builds and Promises in the new builds (we have now correct Promise scheduling in FF, they are also using microtasks).
Based on the profile the page is using crazy amounts of promises and message events.

And something weird with the js. If you look at those spikes in the profile, they are very very deep JS callstacks, af, bf, fc, ...
Flags: needinfo?(bugs)
(In reply to Alice0775 White from comment #16)
> 
> (In reply to Paul Bone [:pbone] from comment #15)
> > 
> > Alice0775 could you also test to see if this is still a problem when nursery
> > poisoning is disabled.  Please add JSGC_DISABLE_POISONING=1 to firefox's
> > environment.  For example on Linux I run:
> > 
> >   $ JSGC_DISABLE_POISONING=1 ./firefox
> > 
> 
> Step:
> Open command prompt
> cd (path to install directory)
> set JSGC_DISABLE_POISONING=1
> .\firefox.exe
> 
> The time it takes to display the page seems to be almost unchanged.

Thanks.

> > If you can also provide the profile for this execution I can check to see if
> > it truly disabled poisoning.
> > 
> > Please also re-test in a release version of firefox and see if it's the same.
> > 
> > If it's the same in both cases then this is something we'll want to
> > investigate further.
> 
> Nightly62.0a1 (20180627222831) approx. 35sec https://perfht.ml/2MvpVMM
> Firefox61.0.0 (20180621125625) approx. 20sec https://perfht.ml/2Kq2SlJ

The difference here may be other things we do in Nightly builds that help to catch problems.  They're definitly having a disproportionate affect on this particular page though.

There's still a large difference with Chrome (your reported timings) and so we should continue to investigate.
(In reply to Olli Pettay [:smaug] from comment #17)
> I don't know why making MutationObserver to use webidl bindings would cause
> this.
> 
> Is the site perhaps using MutationObservers as microtask runners in the old
> builds and Promises in the new builds (we have now correct Promise
> scheduling in FF, they are also using microtasks).
> Based on the profile the page is using crazy amounts of promises and message
> events.

That's worth a look, I'll invesigate there.

> And something weird with the js. If you look at those spikes in the profile,
> they are very very deep JS callstacks, af, bf, fc, ...

I saw that but wasn't sure if it was typical for JS either in the past or "modern JS".  Since you say it's abnormal also it could be worth looking there also.

Thanks.
My best guess is that something on the tenured heap has a lot of nursery pointers.  And for some reason this causes us to perform worse than chrome.  Perhaps it only affects us because it's a pretenuring decision we make that chrome doesn't.  Other theories are that in the deep recursions we're keeping more stack pointers alive that are holding more things live and cusing the high numbers of nursery promotions.

I will (mostly notes for myself):
 * Check the behaviour with and without nursery collection and pretenuring.
 * Try to debug what the tenured->nursery pointers are.
 * Check how many collections there are for each type of collection reason / how many items in each remembered set.
 * Teach the profiler to calculate evicts-per-second and evicted-bytes-per-second (might as well do mark/sweep f0r major 
   collections also).  Verify that this test is on-par with other tests acroding to these stats.
Assignee: nobody → pbone
Status: NEW → ASSIGNED
Summary: Loading(rendering) page is x3 slower than chrome → Loading page is x3 slower than chrome
Whiteboard: [qf] → [qf:p1:f64]
Priority: -- → P1
Depends on: 1473213
Maybe the problem here is we're creating too many Promise objects and we spend too much time GCing them...  Should we consider allocating those objects on the nursory heap?  (See bug 1377500 that tracks some other similar issues)
That's something I looked for, but as far as I could tell Pormises are nursery allocated, but I was only about 85% sure.  I don't exactly know how where in the code to check.  If I'm wrong then yes, that is likely to be at least part of the problem.
One thing I investigated is pretenuring.  I suspected that something was being pre-tentured that shouldn't be, and this was filling the remembered sets.  This doesn't seem to be the case, disabling pretenuring didn't make a difference.

I'm currently writing some code to determine what kinds of objects are involved in tenured->nursery pointers.  Then I can decide if some of those could be handled differently.
Hi Alice, Do you have any contact with the webmaster for this site?  I'm trying to find out if this behaviour is typical for vue.js sites or this is somehow something specific the JS on this site does.
Flags: needinfo?(alice0775)
(In reply to Paul Bone [:pbone] from comment #24)
> Hi Alice, Do you have any contact with the webmaster for this site?

No, I did not.
Flags: needinfo?(alice0775)
It looks as if while the Promises are allocated in the nursery. perhaps the PromiseReactionRecord objects are not.  these have references to the Promise and also to functions which are often in the nursery.  The next questions are: Are PromiseReactionRecords allocated in the nursery and later evicted, or are they always allocated in the tenured heap and why?  Then will allocating them in the nursery speed up this page?
So at least some of the PromiseReactionRecord objects are being nursery allocated.  So they're being tenured because of something else, but I'm not sure what yet.  I'm not yet sure that they're the problem.
Depends on: heapprof
So the Promise and PromiseReactionRecords are the most common tenured objects to appear in a slots buffer (they're in the tenured heap, they have some slots that were assigned nursery pointers).  

                        val
class name                 
Array                   702
Object                   54
Promise                1159
PromiseReactionRecord  1159

The most common object to appear in the slots buffer is an Object, reated at offset 29 (rows 3-5).

                                                                                      val
class name alloc site script                                   class name obj            
Object     ?:0        function Ta(t){var e,n={};for(e in t)n[e Object                   1
           ?:117      function(t,e,n,r){var i=t.ja(),o=t.path. Object                  16
           ?:29       function(t){if(this.e())return null;var  Array                  177
                                                               Promise                190
                                                               PromiseReactionRecord  183

The counts for Array, Promise and PromiseReactionRecord are the counts of tenured objects that point to this nursery object when the slots buffer is scanned.  I excluded already forwarded nursery objects in this table.

It looks as though this object is created in the code (prettified):

    g.H = function(t) {
        return t && !this.C().e() ? {
            ".value": this.Ca(),
            ".priority": this.C().H()
        } : this.Ca()
    }

of http://www.elegantreader.com/dist/vendor.e0b86d19455fa88b4013.js

We could try getting this object to be pre tenured.  But I'm unsure if that will just shift the problem onto something it points to.  I already tried disabling the nursery (same effect as pretenuring everything!) but there was no improvment.

NI: jonco, sfink in case this creates any insight for you.
Flags: needinfo?(sphink)
Flags: needinfo?(jcoppeard)
I tried reducing the tenure rate needed for pretenuring.  Previously the nursery would be collected often due to full slot buffer, but now it has various other reasons (and the tenure rate seems to be a bit lower).

It didn't improve performance noticibly.

I can only explain such a difference with chrome as either 1) some constant factor algorithm is just faster in chrome. 2) there's something we can't allocate in the nursery (now) and it's pointing indirectly at all these other things causing them to be tenured, maybe chrome is able to nursery allocate more things.

I wish there was somewthing like perf.html for chrome, then I could get an idea of what their GC does with this example ;-)
In Chrome devtools->[three dots]->more tools->Javascript profiler ?
And there is the Performance tool too.
Thanks smaug, I took a look but nothing obvious is there that'll give the kind of clues that are really needed.

The profiles in both browsers look similar, but this looked a little odd:

 + Chrome spent between 5-6% of its time doing minor GC
 + Firefox spent between 3-4%.

The Firefox heap is MUCH bigger than Chrome's.  I even tried a new browser instance for both.
 + Chrome: 20-30MB, sometimes just over 100MB at the end of the page load, drops quickly to <10MB.
 + Firefox: ~260MB

It's likely that they're counting different things in this reporting, that it's a difference in the profilers.  However 1) A larger heap means more work per GC but fewer GCs (not clear).

I'm not sure what can be done next, maybe general work (eg improve tenuring performance) would help move the needle on this page.  I'm unassigning myself because I'm out of ideas/things to test / I think i can make better progress elsewhere for now.
Status: ASSIGNED → NEW
Lowinging priority after chatting with sdetar last week about this.
Priority: P1 → P2
After discussions with Paul on this, at this point this bug is somewhat stalled and not possible for FF64.  We are moving the QF target to [qf:p1:67] and something we still need to work on, but just not for the QF 64 release.
Whiteboard: [qf:p1:f64] → [qf:p1:f67]
As requested an ni? for Paul; I just want him to be aware of Bug 1500920, which -may- have an impact on this, as the JIT code was checking the pretenured flag wrong, and as a result could have created tenured objects more often than expected based on setShouldPretenure.
Flags: needinfo?(pbone)
Whiteboard: [qf:p1:f67] → [qf:p1]
This is now a lot faster.  I want to keep the bug open a little longer because there's still a lot of full slot buffer GCs here.  And that's code I'm modifying at the moment.
Flags: needinfo?(sphink)
Flags: needinfo?(pbone)
Flags: needinfo?(jcoppeard)
(In reply to Paul Bone [:pbone] from comment #35)
> This is now a lot faster.  I want to keep the bug open a little longer
> because there's still a lot of full slot buffer GCs here.  And that's code
> I'm modifying at the moment.

I retested this on latest Nightly, but nothing improved.
Build ID 	20181115100051
User Agent 	Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0

It takes around 30s on Nightly65.0a1, but 10s on Firefox19 and 8s on Chrome Dev72....
Weird,

Firefox Nightly 20181115223444: 14sec
Chrome 70.0.3538.77: 9sec

Still not as fast as it should be, but it used to be 20+seconds for me in Firefox.

I see that yours is much slower. I don't know why, maybe Windows vs Linux is a factor.

Thanks.
the page is also rather slow on my windows10x64 pc on latest nightly. therefore I also created a profile.
maybe its helpfull for someone:

the site feels slow to me, because the waiting spinner behind the "513 comments" text spins for ~15 seconds.

https://perfht.ml/2PtHP8B
Depends on: 1505624
Flags: needinfo?(kvijayan)
This page uses tons of promises, so anything under bug 1342037 would probably help.
Whiteboard: [qf:p1] → [qf:p1:pageload]
Depends on: 1342037

This is still a really bad problem. It's not 30s, but still takes around 10s+ while Chrome is around 4.

https://share.firefox.dev/3dEg7R0

That profile is awful from a responsiveness perspective. It's allocating fast enough that the GC keeps going nonincremental to catch up.

It also has frequent long delays while it flushes the microtask queue.

For the GC stuff, it would probably be better to slow the mutator down and shift time towards the GC until it can keep up. But that's not going to help the 10s load, and might even make it worse.

It seems to do lots of FETCH_ITEMS calls (within a promise callback), without ever yielding. There's a fair amount of, but not all that much, network stuff going on throughout the whole run.

Adding some metrics here for reference:

ContentfulSpeedIndex:  Chrome:  2580.00 (+- 109) Firefox:  4388.33 (+- 190) ,  -70.09%
PageLoadTime:          Chrome:  1942.33 (+- 259) Firefox:  3476.44 (+- 889) ,  -78.98%

Gonna move this to p2 since it's not a critical website and fixing this would probably require a large architectural change.

Whiteboard: [qf:p1:pageload] → [qf:p2:pageload]
Summary: Loading page is x3 slower than chrome → Loading page is x2 slower than chrome
Assignee: pbone → nobody
Performance Impact: --- → P2
Keywords: perf:pageload
Whiteboard: [qf:p2:pageload]
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: