Open Bug 1358899 Opened 7 years ago Updated 1 year ago

Opening a link from an external application while Firefox is idle is slow

Categories

(Core :: JavaScript: GC, defect, P4)

defect

Tracking

()

Performance Impact none

People

(Reporter: florian, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

(Keywords: perf)

I just clicked a link in Telegram, and somehow that link opened in a Nightly instance I had forgotten in the background rather than in the Firefox instance I was currently using. This felt really slow, I had time to see the new tab opening (it stayed empty long enough for me to notice before the page started loading). I had the profiler running, so I captured the profile: https://perfht.ml/2pTH3nu

In this profile, we spend more than 800ms (on a super fast macbook) between when the OS raises the Firefox window, and when we start actually loading the new URL in a content process.

Here are a few things I noticed in the profile:
- when the window is raised, we bring the focus to it, and spent time doing at least 2 layout flushes. One was triggered by nsWebShellWindow::WindowActivated() and one was due to focusing the URL bar in the existing tab, before we start opening a new tab (bluring it causes a sync flush, and selecting the new tab too). Could we create the new tab before the window is raised?
- we wait too long to start preparing a content process (I apparently didn't have one yet; the only existing tab in that instance was about:sessionrestore loaded in the chrome process)
- 135ms of the 800ms are spent GCing; and it's being done at the most inconvenient time, from the onxbltransitionend event handler (that's what caused the tab to stay empty for long enough for me to notice at the end of the CSS transition when opening it). I suspect this long GC may be related to the process having been idle in the background with throttled events for a while.

Ehsan do you know if we already have a bug on file to avoid this GC behavior? Also, this profile could do with platform eyes looking at it.
Flags: needinfo?(ehsan)
Flags: qe-verify?
Priority: -- → P2
(In reply to Florian Quèze [:florian] [:flo] from comment #0)
> I just clicked a link in Telegram, and somehow that link opened in a Nightly
> instance I had forgotten in the background rather than in the Firefox
> instance I was currently using. This felt really slow, I had time to see the
> new tab opening (it stayed empty long enough for me to notice before the
> page started loading). I had the profiler running, so I captured the
> profile: https://perfht.ml/2pTH3nu
> 
> In this profile, we spend more than 800ms (on a super fast macbook) between
> when the OS raises the Firefox window, and when we start actually loading
> the new URL in a content process.
> 
> Here are a few things I noticed in the profile:
> - when the window is raised, we bring the focus to it, and spent time doing
> at least 2 layout flushes. One was triggered by
> nsWebShellWindow::WindowActivated() and one was due to focusing the URL bar
> in the existing tab, before we start opening a new tab (bluring it causes a
> sync flush, and selecting the new tab too). Could we create the new tab
> before the window is raised?

Some of this is bug 1334642 (around 18ms).  Around 29ms is some add-on SDK code that emits events on tabbrowser (https://perfht.ml/2pdn5Du).  I never really understood why this is needed and what this actually does.  I'm 100% sure that Kris on some bug looked into the what and why, but I can't find it right now.  Kris, do you remmeber?  My preference is to kill this code if at all possible, this keeps coming up in profiles all the time!

Another interesting question is why we are flushing layout here, and not frames <http://searchfox.org/mozilla-central/rev/313e5199bf58200f158c6fcbe193e41b88ed58a6/dom/base/nsFocusManager.cpp#1552>.  This is taking a long time (89ms).  Neil, is flushing layout really necessary here?

> - we wait too long to start preparing a content process (I apparently didn't
> have one yet; the only existing tab in that instance was
> about:sessionrestore loaded in the chrome process)

Bug 1349840.

> - 135ms of the 800ms are spent GCing; and it's being done at the most
> inconvenient time, from the onxbltransitionend event handler (that's what
> caused the tab to stay empty for long enough for me to notice at the end of
> the CSS transition when opening it). I suspect this long GC may be related
> to the process having been idle in the background with throttled events for
> a while.

Kannan may be interested to look at the GC.  Right now our GC story is pretty sad, this kind of thing is very normal to see.  We may choose to GC at any inconvenient time for any lengthy period of time.

> Ehsan do you know if we already have a bug on file to avoid this GC
> behavior? Also, this profile could do with platform eyes looking at it.

We have bug 1352524 on file...
Flags: needinfo?(kvijayan)
Flags: needinfo?(kmaglione+bmo)
Flags: needinfo?(enndeakin)
Flags: needinfo?(ehsan)
What's interesting about this GC profile is spending ~80ms in Nursery::collect() <https://perfht.ml/2pdtOgP>.  Then after we're done doing that we go ahead and spend 53ms in GCRuntime::beginMarkPhase() <https://perfht.ml/2pdIUmt>, first 23ms in CancelOffThreadIonCompile() calling the dreaded FinishOffThreadBuilder() and once we're done doing that we go and spend another 28ms calling Zone::discardJitCode().  The summary is that we are not very fast at releasing memory.  This all happens when we're parsing the global scope of tab-content.js.  Ironically as far as I can tell we only take about 3ms actually in the RunScript running this global scope itself, the rest is pure GC overhead.
This bug deserves to be in the GC component.  :-)
Component: General → JavaScript: GC
Product: Firefox → Core
Is the GC-being-slow reproduceable here? (doesn't make too much sense to make random bug depend on GC scheduling).
(In reply to :Ehsan Akhgari (super long backlog, slow to respond, not reviewing for a while) from comment #1)

> Another interesting question is why we are flushing layout here, and not
> frames
> <http://searchfox.org/mozilla-central/rev/
> 313e5199bf58200f158c6fcbe193e41b88ed58a6/dom/base/nsFocusManager.cpp#1552>. 
> This is taking a long time (89ms).  Neil, is flushing layout really
> necessary here?

That's a question for bug 1356034.
(In reply to Olli Pettay [:smaug] from comment #4)
> Is the GC-being-slow reproduceable here? (doesn't make too much sense to
> make random bug depend on GC scheduling).

I think the point here is to ensure GC doesn't happen in the middle of a tab animation. I don't know if it's reproduceable.
(In reply to Olli Pettay [:smaug] from comment #4)
> Is the GC-being-slow reproduceable here? (doesn't make too much sense to
> make random bug depend on GC scheduling).

Kannan had asked me to send him slow GC profiles...
(And these GC issues are almost never reproducible BTW.)
(In reply to :Ehsan Akhgari (super long backlog, slow to respond, not reviewing for a while) from comment #1)
> Some of this is bug 1334642 (around 18ms).  Around 29ms is some add-on SDK
> code that emits events on tabbrowser (https://perfht.ml/2pdn5Du).  I never
> really understood why this is needed and what this actually does.  I'm 100%
> sure that Kris on some bug looked into the what and why, but I can't find it
> right now.  Kris, do you remmeber?  My preference is to kill this code if at
> all possible, this keeps coming up in profiles all the time!

It's updating the state of the profiler's toolbar button. Basically, of the
8ms this code takes, 7ms is spent generating a state object, updating the
existing state object, diffing that against he last state object to see what's
changed, generating a new state object for the button based on those changes,
diffing that against the current state to see what we need to change, and then
changing it to match that current state.

I wish we could get rid of it... but we can't until we get rid of the SDK. On
the plus side, it should disappear from profiles once the profiler migrates to
a WebExtension.
Flags: needinfo?(kmaglione+bmo)
(In reply to :Ehsan Akhgari (super long backlog, slow to respond, not reviewing for a while) from comment #1)
> > - we wait too long to start preparing a content process (I apparently didn't
> > have one yet; the only existing tab in that instance was
> > about:sessionrestore loaded in the chrome process)
>
> Bug 1349840.

See also bug 1341008. I believe that should fix this part of the problem
problem, and I think Gabor is working on getting it landed this week.
See Also: → 1341008
This profile contains a 78ms nursery collection.  That seems crazy super long to me.

21ms tracing the store buffer.  27ms doing various sweeping activity. Bunch of other smaller times doing other things.
Flags: needinfo?(kvijayan)
Note, currently Nightly has some super bad leak regression. Bug 1357872
(In reply to Kris Maglione [:kmag] from comment #10)
> (In reply to :Ehsan Akhgari (super long backlog, slow to respond, not
> reviewing for a while) from comment #1)
> > > - we wait too long to start preparing a content process (I apparently didn't
> > > have one yet; the only existing tab in that instance was
> > > about:sessionrestore loaded in the chrome process)
> >
> > Bug 1349840.
> 
> See also bug 1341008. I believe that should fix this part of the problem
> problem, and I think Gabor is working on getting it landed this week.

I'm also adding a probe for measuring this kind of delays in bug 1352961. There are some delays there too because we don't have the infrastructure to measure time difference across process boundaries in JS... I think the real perf. killer is that we have to load/compile/execute a bunch of frame/process scripts before we start to listen to the message that would trigger the first navigation (and as it seems this can even trigger some GC work just for the fun of it).

There are two code paths. Either we start the tab with a remote browser and
then the RemoteWebNavigation will send the request. Or we start with a non-remote
browser and have to change the remoteness flag on it, and then the SessionStore
will send the request.

Anyway, I hope bug 1341008 will mitigate this problem.
(In reply to Kannan Vijayan [:djvj] from comment #11)
> This profile contains a 78ms nursery collection.  That seems crazy super
> long to me.

That sounds like bug 1298816, which would mean it's profiler overhead. I just landed that bug. If it sticks, it would be interesting to see if it helps here. The numbers I got in that bug weren't as good as I hoped.

> 21ms tracing the store buffer.  27ms doing various sweeping activity. Bunch
> of other smaller times doing other things.

...though on the other hand, I wouldn't expect bug 1298816 to cause a slow store buffer. I could be wrong?
Flags: needinfo?(enndeakin)
Whiteboard: [photon-performance][qf] → [photon-performance][qf-]
It's possible that the long running GC here would have been helped if bug 1351790 was fixed.
Depends on: 1351790
Whiteboard: [photon-performance][qf-] → [photon-performance][qf]
Whiteboard: [photon-performance][qf] → [photon-performance][qf-]
Priority: P2 → P3
Whiteboard: [photon-performance][qf-] → [reserve-photon-performance][qf-]
Could he process have been swapped out since it was in the background and had been unused?
This would cause slowness in GC wince GC reads a fair amount of memory.
Flags: qe-verify? → qe-verify-
Priority: P3 → P4
Keywords: perf
(In reply to Paul Bone (PTO) [:pbone] from comment #16)
> Could he process have been swapped out since it was in the background and
> had been unused?
> This would cause slowness in GC wince GC reads a fair amount of memory.

I get into states where I have several seconds of disk IO (2-15) to my laptop SSD before the page opens.
Whiteboard: [reserve-photon-performance][qf-] → [fxperf][qf-]
Not actionable for the front-end performance team.
Whiteboard: [fxperf][qf-] → [qf-]
Performance Impact: --- → -
Whiteboard: [qf-]
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.