Closed Bug 1184569 Opened 9 years ago Closed 2 years ago

Firefox janks hard when loading a Google Docs presentation

Categories

(Core :: JavaScript Engine, defect)

defect

Tracking

()

RESOLVED WORKSFORME
Performance Impact low
Tracking Status
platform-rel --- +

People

(Reporter: mconley, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: perf, Whiteboard: [platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides])

If this is still happening we probably need a new profile to figure out what's going on. :-(
Component: General → Untriaged
Hi,

The issue is still reproducible on the latest Nightly (48.0a1, Build ID 20160310030242) on Windows 10 x64 with e10s disabled.
Ran a gecko profile without e10s:
https://cleopatra.io/#report=1457ea5b8993006a5cb46d721f86e391050e59cc&selection=0,1,2794,3213,3214,3215,2568,4,6,4,3204
And one with e10s:
https://cleopatra.io/#report=82fbbb82eea05b904c05abaaeb1777b2799fca90&selection=0,124,125,126,127,128,129,130,1,131,132,133,134,135,136,137,138,284,139,285,286,287,288,289,290,291
But I'm not sure what component this could be put in. Any suggestions Gijs?
Flags: needinfo?(gijskruitbosch+bugs)
OS: Unspecified → All
Hardware: Unspecified → All
(In reply to Ciprian Muresan [:cmuresan] from comment #2)
> Hi,
> 
> The issue is still reproducible on the latest Nightly (48.0a1, Build ID
> 20160310030242) on Windows 10 x64 with e10s disabled.
> Ran a gecko profile without e10s:
> https://cleopatra.io/
> #report=1457ea5b8993006a5cb46d721f86e391050e59cc&selection=0,1,2794,3213,
> 3214,3215,2568,4,6,4,3204
> And one with e10s:
> https://cleopatra.io/
> #report=82fbbb82eea05b904c05abaaeb1777b2799fca90&selection=0,124,125,126,127,
> 128,129,130,1,131,132,133,134,135,136,137,138,284,139,285,286,287,288,289,
> 290,291
> But I'm not sure what component this could be put in. Any suggestions Gijs?

The profile looks awful - 25% of the total time of the profile is spent in JS land for the page in question, but it's not immediately clear to me why that's so slow and janky, and why we seem to be spending a fraction of that time on this stuff in e10s mode. Benoit or Mike, can you help find a component / owner for this?
Flags: needinfo?(mconley)
Flags: needinfo?(gijskruitbosch+bugs)
Flags: needinfo?(bgirard)
Yes, in the non e10s profile we're spending too much time in JS. I see a lot of bailout notifications so perhaps there's some tweaking we could do in the JS engine.

In the e10s profile I see a lot of costly 'IPDL::PContent::SendRpcMessage' sync IPC messages and no so much JS execution.
Flags: needinfo?(bgirard)
Those SendRpcMessage's are due to an observer notification being proxied to the parent synchronously over the add-on shims.

This means that there's an add-on installed that's not marked multi-process compatible, and it's janking the browser pretty hard.
Flags: needinfo?(mconley)
Component: Untriaged → JavaScript Engine
Product: Firefox → Core
mconley, from comment 5 it sounds like this is not really a JS Engine bug. Can you re-assign or WONTFIX?

cmuresan, does disabling addons make the problem go away?
Flags: needinfo?(mconley)
Flags: needinfo?(ciprian.muresan)
Yes, assuming disabling add-ons makes this problem go away, we'll need a full list of enabled add-ons.
Component: JavaScript Engine → Add-ons
Flags: needinfo?(mconley)
Product: Core → Tech Evangelism
The issue is reproducible with e10s disabled and with addons disabled. The jank is not as apparent with e10s enabled. I put it in JS engine based on comment 4, because the user story specifically asks for e10s to be disabled.
Component: Add-ons → JavaScript Engine
Flags: needinfo?(ciprian.muresan)
Product: Tech Evangelism → Core
Fair enough. Thanks.
Whiteboard: [platform-rel-Google][platform-rel-GoogleDocs]
platform-rel: --- → ?
platform-rel: ? → +
Flags: needinfo?(bkelly)
nbp, you've had some great results looking into bailout issues, so I'm going to ni? you on this. Thanks!
Flags: needinfo?(nicolas.b.pierron)
Here's a new profile:

https://cleopatra.io/#report=3cf1cb633f1aab98158736804a6fb156eb771f34

From what I can tell we spend:

* 2.7 seconds in nsHtml5TreeOpExecutor::RunFlushLoop()
* 1.8 seconds in an XHR's OnStopRequest dispatching an event
* About 2.8 seconds in large timeout callbacks

The nsHtml5TreeOpExecutor and XHR seem to cause most of the jank here.  We are still janking during the timeouts, but it starts recovering over that period.
Not sure its relevent yet, but some of that nsHtml5TreeOpExecutor time is being triggered from the nsHtml5ExecutorReflusher() runnable.  This fires on a repeating timer every 50ms:

https://dxr.mozilla.org/mozilla-central/source/parser/html/nsHtml5TreeOpExecutor.cpp#278

I wonder if we are backing up the main thread with too many of these runnables.
Ah, the timer only runs for background tabs.  Sorry for my confusion.
Flags: needinfo?(bkelly)
My anti-flooding patches help marginally here, but its still spending a ton of time in js.  Given all the bailouts in the profile I'll just leave this to nbp to look at.  Sorry I couldn't be of more help.
Ben Kelly, thanks for making the profile on comment 11.

From the profile, I spotted only one repeated bailout (~10x "Bailout X from Y", "Invalidate Y", "Bailout X from Y"). This alone is unlikely to cause all the slow-down. I think the issue is similar to a bug that I already started to investigate (Bug 1300848).

One observation is that we have a lot of diverse bailouts and invalidations, but none which stands up as being the source of the slow-down.  I yet have to understand these issues better, in the mean time I am adding blockers to Bug 1300848.  Still, I do not see any such large set of Bailouts & Invalidations in the original profiles, so this might be a recent regression introduced within the last 6 months.  Thus unrelated to the initial issue.

I will continue my investigation of this potential regression (from the last 6 months) in Bug 1300848, as I think these might have identical root causes, and also because I managed to get an offline version to reproduce the issue.

I will keep the need-info to double check once Bug 1300848 is fixed that this is indeed the same issue, but I do not think this is the original issue reported in comment 0.

Is there a way to report Baseline / Ion frames in Cleopatra?!  The cost of a bailout is the time spent in between, not the number of bailouts.  So having a large number of bailout/invalidation is only a hint, not a symptom.
Whiteboard: [platform-rel-Google][platform-rel-GoogleDocs] → [platform-rel-Google][platform-rel-GoogleDocs][platform-rel-GoogleSlides]
Whiteboard: [platform-rel-Google][platform-rel-GoogleDocs][platform-rel-GoogleSlides] → [platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides]
Rank: 5
(In reply to Nicolas B. Pierron [:nbp] from comment #15)
> Is there a way to report Baseline / Ion frames in Cleopatra?!  The cost of a
> bailout is the time spent in between, not the number of bailouts.  So having
> a large number of bailout/invalidation is only a hint, not a symptom.

It turns out that there is, Shu added instrumentation for it a long time ago. It was just getting stripped out when the old Cleopatra processed the profiles, which led me to think that we didn't have it. It isn't all integrated yet, but I have a separate script now that can tell me the percentage of time spent in baseline vs ion for any given script (or even the "timeline" of baseline/ion samples for that script, eg BBBBIIBBBBBIBBBBB is a script that runs in baseline, then runs in ion long enough to get running for two samples before it reverts to baseline, etc.) Does that give you what you would want to know here?

Generally, if we had a "profile linter" that could tell you of suspicious behavior, what could it look for? From your discussion in this bug (from which I learned several things; thank you), it seems like a script with repeated bailouts at the same location indicates a problem in spidermonkey, and if it is accompanied by a high percentage of time spent in baseline, then it is a likely candidate for a fixable performance problem. Is that accurate? When computing the percentage of time spent in baseline vs ion, would it make sense to ignore the initial warmup time spent in baseline? Or more generally, what do you think would be useful to look for?

Is it always a problem to see repeated invalidating bailouts at the same location? I could imagine that if we are discarding ion code during GC, then perhaps that isn't necessarily bad. (Or if it's the same filename:lineno due to the same code running in separate tabs.) I could speculate as to how to properly define "same location", but I'd like to know what information is helpful here in the first place before figuring out how to make it completely accurate.
Depends on: 1329898
(In reply to Steve Fink [:sfink] [:s:] from comment #16)
> (In reply to Nicolas B. Pierron [:nbp] from comment #15)
> > Is there a way to report Baseline / Ion frames in Cleopatra?!  The cost of a
> > bailout is the time spent in between, not the number of bailouts.  So having
> > a large number of bailout/invalidation is only a hint, not a symptom.
> 
> […] I have a separate script now that can tell me the
> percentage of time spent in baseline vs ion for any given script (or even
> the "timeline" of baseline/ion samples for that script, eg BBBBIIBBBBBIBBBBB
> is a script that runs in baseline, then runs in ion long enough to get
> running for two samples before it reverts to baseline, etc.) Does that give
> you what you would want to know here?

Yes, that would be enough to start hacking some metrics, except that this would miss cases where we bailout immediately after the entry into Ion.  For that, I guess we can annotate the bailouts with the compilation id (available from the IonScript*), then we can catch Ion entry even if no sample reports Ion execution.

> Generally, if we had a "profile linter" that could tell you of suspicious
> behavior, what could it look for? From your discussion in this bug (from
> which I learned several things; thank you), it seems like a script with
> repeated bailouts at the same location indicates a problem in spidermonkey,
> and if it is accompanied by a high percentage of time spent in baseline,
> then it is a likely candidate for a fixable performance problem. Is that
> accurate?

This accurate, but as I noticed on Bug 1300848 these are not necessary the major source of issues.
We are looking for example, in the following profile:

T¹  ............C¹C¹C¹C¹C¹C¹C¹C¹....................C²C²C²C²C²C²C²C²................
T⁰  B⁰B⁰B⁰B⁰B⁰C¹B⁰B⁰B⁰B⁰B⁰B⁰B⁰B⁰C¹I¹I¹I¹B¹B¹B¹B¹C²B¹B¹B¹B¹B¹B¹B¹B¹B¹C²I²I²B²B²B²B²B²

  where C = compilation time; I = Ion time; B = Baseline time;
        T¹: is the compilation thread;
        T⁰: is the main thread.

We are looking for useless compilations:
    S = (C + I) / +I > ~20, for long running (= B + I) scripts.

    where S = speed-up needed to compensate the compilation time.

Otherwise we could be looking for the overall bailout costs:
    2 · (1 - 1 / S) · B

    Note, the normal formulae is not using the sum of all compilations & executions, but doing it as a per-recompilation basis.  Using the sums is not representative of the average, but as we do not know the expected speed-up (= S) yet, we cannot blame single bailouts from a profile, and using the sum might be a good enough approximation.

> When computing the percentage of time spent in baseline vs ion,
> would it make sense to ignore the initial warmup time spent in baseline?

The initial warm-up in baseline might be interesting if the compilation never ends successfully, in which case high baseline time would still highlight issues.

> Or more generally, what do you think would be useful to look for?

(see the above formulae)

> Is it always a problem to see repeated invalidating bailouts at the same
> location? I could imagine that if we are discarding ion code during GC, then
> perhaps that isn't necessarily bad. (Or if it's the same filename:lineno due
> to the same code running in separate tabs.) I could speculate as to how to
> properly define "same location", but I'd like to know what information is
> helpful here in the first place before figuring out how to make it
> completely accurate.

Repeated invalidation bailouts highlight that we failed to collect information to prevent the thing to happen the next time.
Indeed, a GC would remove tons of information which could have helped at avoiding such cases, but if a GC causes us to have too many of these repeated invalidation bailouts, then this would still be a problem that we would have to fix.
Whiteboard: [platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides] → [qf:p1][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides]
Hannes this bug is now a qf:p1. Is nbp still the right person to make sure this moves forward.
Flags: needinfo?(hv1989)
New profile from the Gecko Profiler: https://perfht.ml/2otEqZ6.

Note that comment 0 talks about disabling e10s for historical reasons.  There is nothing non-e10s specific about this bug.
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #19)
> New profile from the Gecko Profiler: https://perfht.ml/2otEqZ6.
> 
> Note that comment 0 talks about disabling e10s for historical reasons. 
> There is nothing non-e10s specific about this bug.

It does not seem the new profile is JS bound, unless I'm missing something. This bug doesn't have anything else concrete in it -- unsetting this as qf:p1.
Whiteboard: [qf:p1][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides] → [qf][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides]
(In reply to Shu-yu Guo [:shu] from comment #20)
> (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> comment #19)
> > New profile from the Gecko Profiler: https://perfht.ml/2otEqZ6.
> > 
> > Note that comment 0 talks about disabling e10s for historical reasons. 
> > There is nothing non-e10s specific about this bug.
> 
> It does not seem the new profile is JS bound

It is.  GDocs performance in general is JS bound and that's pretty well known.
needinfo Kannan to run his script to see what's spidermonkey doing here.
Flags: needinfo?(kvijayan)
Whiteboard: [qf][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides] → [qf:investigate:p1][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides]
Numbers for scripts >1MM ticks (I had to reload the page several times to hit the buffer threshold at which the tick counts were dumped).

Lots of array stuff, mostly, and the usual culprits like _FinishBoundFunctionInit and _IsConstructing:

std_Math_max                => ticks=10460052 count=95994
array_pop                   => ticks=10536018 count=24336
ToInteger                   => ticks=10923164 count=108243
std_Array_unshift           => ticks=11557166 count=501
std_Array_slice             => ticks=13607854 count=1334
ToString                    => ticks=13610192 count=144282
regexp_clone                => ticks=16836900 count=10013
IsArray                     => ticks=18471105 count=288596
array_shift                 => ticks=19968867 count=16535
std_Function_apply          => ticks=20769472 count=1000
NewArrayIterator            => ticks=22165072 count=23576
intl_availableCollations    => ticks=27741953 count=1
str_indexOf                 => ticks=29607368 count=44249
array_indexOf               => ticks=44812005 count=9066
String_substr               => ticks=46137106 count=137569
array_join                  => ticks=56424855 count=16457
String_substring            => ticks=60051429 count=20958
str_toLowerCase             => ticks=67185423 count=47336
array_splice                => ticks=88442465 count=15485
RegExpTester                => ticks=90288815 count=19427
RegExpSearcher              => ticks=90492706 count=26342
ArrayValues                 => ticks=96327671 count=23405
array_unshift               => ticks=189381907 count=76775
RegExpMatcher               => ticks=192140874 count=32647
_IsConstructing             => ticks=234657803 count=297712
_FinishBoundFunctionInit    => ticks=312683955 count=56445
std_Array                   => ticks=408231750 count=183697
StringSplitString           => ticks=438067699 count=78954
array_push                  => ticks=555615494 count=752127
array_sort                  => ticks=563290692 count=77097
String_split                => ticks=628484359 count=79157
array_slice                 => ticks=1820460442 count=1662435
Flags: needinfo?(kvijayan)
I added instrumentation for FunctionBind (the self-hosted function that uses _FinishBoundFunctionInit), and re-ran.  That's high up there in the tick count too:

FunctionBind                => ticks=457743853 count=49300
Whiteboard: [qf:investigate:p1][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides] → [qf:p3][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides]
(already answered comment 10)
Answering my-self from comment 15: Bug 1300848 is fixed probably because of the work made to Cache IR.  So this is unlikely to be the same issue.
Flags: needinfo?(nicolas.b.pierron)
Performance Impact: --- → P3
Whiteboard: [qf:p3][platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides] → [platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSlides]
Severity: normal → S3

Clear a needinfo that is pending on an inactive user.

Inactive users most likely will not respond; if the missing information is essential and cannot be collected another way, the bug maybe should be closed as INCOMPLETE.

For more information, please visit auto_nag documentation.

Flags: needinfo?(hv1989)

I can't reproduce this. A lot has changed since 2015 so this was likely fixed on our side and/or by Google Slides changes.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.